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Abstract 

We present a canonical way to turn any smooth parametric fam- 
ily of probability distributions on an arbitrary search space X into a 
continuous-time black-box optimization method on X, the information- 
geometric optimization (IGO) method. Invariance as a major design 
principle keeps the number of arbitrary choices to a minimum. The re- 
sulting method conducts a natural gradient ascent using an adaptive, 
time-dependent transformation of the objective function, and makes 
no particular assumptions on the objective function to be optimized. 

The IGO method produces explicit IGO algorithms through time 
discretization. The cross-entropy method is recovered in a particular 
case with a large time step, and can be extended into a smoothed, 
parametrization-independent maximum likelihood update. 

When applied to specific families of distributions on discrete or con- 
tinuous spaces, the IGO framework allows to naturally recover versions 
of known algorithms. From the family of Gaussian distributions on M'', 
we arrive at a version of the well-known CMA-ES algorithm. From the 
family of Bernoulli distributions on {0, 1}'*, we recover the seminal 
PBIL algorithm. From the distributions of restricted Boltzmann ma- 
chines, we naturally obtain a novel algorithm for discrete optimization 
on {0,1}''. 

The IGO method achieves, thanks to its intrinsic formulation, max- 
imal invariance properties: invariance under reparametrization of the 
search space X, under a change of parameters of the probability dis- 
tribution, and under increasing transformation of the function to be 
optimized. The latter is achieved thanks to an adaptive formulation of 
the objective. 

Theoretical considerations strongly suggest that IGO algorithms 
are characterized by a minimal change of the distribution. Therefore 
they have minimal loss in diversity through the course of optimization, 
provided the initial diversity is high. First experiments using restricted 
Boltzmann machines confirm this insight. As a simple consequence, 
IGO seems to provide, from information theory, an elegant way to 
spontaneously explore several valleys of a fitness landscape in a single 
run. 
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Introduction 

In this article, we consider an objective function / : X — )■ M to be minimized 



over a search space X. No particular assumption on X is needed: it may 
be discrete or continuous, finite or infinite. We adopt a standard scenario 
where we consider / as a black box that delivers values f{x) for any desired 
input X £ X. The objective of black-box optimization is to find solutions 
X € X with small value f{x), using the least number of calls to the black 
box. In this context, we design a stochastic optimization method from sound 
theoretical principles. 

We assume that we are given a family of probability distributions Pg 
on X depending on a continuous multicomponent parameter 9 £ Q. A 
basic example is to take X = and to consider the family of all Gaussian 
distributions Pg on R"', with 9 = (m, C) the mean and covariance matrix. 
Another simple example is X = {0,1}'^ equipped with the family of Bernoulli 
measures, i.e. 9 = {9i)i<^i^d and Pe{x) = Jl^i^'ll - for x = (xj) e X. 

The parameters 9 of the family Pg belong to a space, O, assumed to be a 
smooth manifold. 
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Prom this setting, we build in a natural way an optimization method, 
the information- geometric optimization (IGO). At each time t, we maintain 
a value 6* such that Pgt represents, loosely speaking, the current belief about 
where the smallest values of the function / may lie. Over time, P^t evolves 
and is expected to concentrate around the minima of /. This general ap- 
proach resembles the wide family of estimation of distribution algorithms 
(EDA) [LL02, BC95, PGL02]. However, we deviate somewhat from the 
common EDA reasoning, as explained in the following. 

The IGO method takes the form of a gradient ascent on ^* in the pa- 
rameter space G. We follow the gradient of a suitable transformation of /, 
based on the Pgi-quantiles of /. The gradient used for 9 is the natural gradi- 
ent defined from the Fisher information metric [Rao45, Jcf46, ANOO], as is 
the case in various other optimization strategies, for instance so-called nat- 
ural evolution strategies [WSPS08, SWSS09, GSS+10]. Thus, we extend the 
scope of optimization strategies based on this gradient to arbitrary search 
spaces. 

The IGO method also has an equivalent description as an infinitesimal 
maximum likelihood update; this reveals a new property of the natural gra- 
dient. This also relates IGO to the cross- entropy method [dBKMROS] in 
some situations. 

When we instantiate IGO using the family of Gaussian distributions 
on we naturally obtain variants of the well-known covariance matrix 
adaptation evolution strategy (CMA-ES) [HOOl, HK04, JAOG] and of natural 
evolution strategies. With Bernoulli measures on the discrete cube {0,1}'^, 
we recover the well-known population based incremental learning (PBIL) 
[BC95, Bal94]; this derivation of PBIL as a natural gradient ascent appears 
to be new, and sheds some light on the common ground between continuous 
and discrete optimization. 

From the IGO framework, it is immediate to build new optimization 
algorithms using more complex families of distributions than Gaussian or 
Bernoulli. As an illustration, distributions associated with restricted Boltz- 
mann machines (RBMs) provide a new but natural algorithm for discrete 
optimization on {0, l}"^, able to handle dependencies between the bits (see 
also [Ber02]). The probability distributions associated with RBMs are multi- 
modal; combined with specific information-theoretic properties of IGO that 
guarantee minimal loss of diversity over time, this allows IGO to reach mul- 
tiple optima at once very naturally, at least in a simple experimental setup. 

Our method is built to achieve maximal invariance properties. First, it 
will be invariant under reparametrization of the family of distributions Pg, 
that is, at least for infinitesimally small steps, it only depends on Pg and not 
on the way we write the parameter 9. (For instance, for Gaussian measures 
it should not matter whether we use the covariance matrix or its inverse or 
a Cholesky factor as the parameter.) This limits the influence of encoding 
choices on the behavior of the algorithm. Second, it will be invariant under 
a change of coordinates in the search space X, provided that this change of 
coordinates globally preserves our family of distributions Pg . (For Gaussian 
distributions on W^, this includes all affine changes of coordinates.) This 
means that the algorithm, apart from initialization, does not depend on 
the precise way the data is presented. Last, the algorithm will be invariant 
under applying an increasing function to /, so that it is indifferent whether 
we minimize e.g. /, or / x |/|~^'^^. This way some non-convex or non- 
smooth functions can be as "easily" optimised as convex ones. Contrary to 
previous formulations using natural gradients [WSPS08, GSS'^IO, ANOKIO], 
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this invariance under increasing transformation of the objective function is 
achieved from the start. 

Invariance under X-reparametrization has been — we beheve — one of the 
keys to the success of the CMA-ES algorithm, which derives from a par- 
ticular case of ours. Invariance under 0-reparametrization is the main idea 
behind information geometry [v\NOO]. Invariance under /-transformation 
is not uncommon, e.g., for evolution strategies [Sch95] or pattern search 
methods [H.J61, Tor97, NM65]; however it is not always recognized as an 
attractive feature. Such invariance properties mean that we deal with in- 
trinsic properties of the objects themselves, and not with the way we encode 
them as collections of numbers in W^. It also means, most importantly, that 
we make a minimal number of arbitrary choices. 

In Section 1, we define the IGO flow and the IGO algorithm. We begin 
with standard facts about the definition and basic properties of the natural 
gradient, and its connection with Kullback-Leibler divergence and diversity. 
We then proceed to the detailed description of our algorithm. 

In Section 2, we state some first mathematical properties of IGO. These 
include monotone improvement of the objective function, invariance prop- 
erties, the form of IGO for exponential families of probability distributions, 
and the case of noisy objective functions. 

In Section 3 we explain the theoretical relationships between IGO, max- 
imum likelihood estimates and the cross-entropy method. In particular, 
IGO is uniquely characterized by a weighted log-likelihood maximization 
property. 

In Section 4, we derive several well-known optimization algorithms from 
the IGO framework. These include PBIL, versions of CMA-ES and other 
Gaussian evolutionary algorithms such as EMNA. This also illustrates how 
a large step size results in more and more differing algorithms w.r.t. the 
continuous-time IGO fiow. 

In Section 5, we illustrate how IGO can be used to design new optimiza- 
tion algorithms. As a proof of concept, we derive the IGO algorithm associ- 
ated with restricted Boltzmann machines for discrete optimization, allowing 
for multimodal optimization. We perform a preliminary experimental study 
of the specific influence of the Fisher information matrix on the performance 
of the algorithm and on diversity of the optima obtained. 

In Section 6, we discuss related work, and in particular, IGO's relation- 
ship with and differences from various other optimization algorithms such 
as natural evolution strategies or the cross-entropy method. We also sum 
up the main contributions of the paper and the design philosophy of IGO. 

1 Algorithm description 

We now present the outline of our algorithms. Each step is described in 
more detail in the sections below. 

Our method can be seen as an estimation of distribution algorithm: at 
each time t, we maintain a probability distribution Pgt on the search space 
X, where 0* S Q. The value of 0* will evolve so that, over time, Pgt gives 
more weight to points x with better values of the function f{x) to optimize. 

A straightforward way to proceed is to transfer / from x-space to 9- 
space: define a function F{6) as the P^i-average of / and then to do a 
gradient descent for F{6) in space [BcrOO]. This way, 9 will converge to 
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a point such that Pg yields a good average value of /. We depart from this 
approach in two ways: 

• At each time, we replace / with an adaptive transformation of / rep- 
resenting how good or bad observed values of / are relative to other 
observations. This provides invariance under all monotone transfor- 
mations of /. 

• Instead of the vanilla gradient for 9, we use the so-called natural gradi- 
ent given by the Fisher information matrix. This reflects the intrinsic 
geometry of the space of probability distributions, as introduced by 
Rao and Jeffreys [Rao45, .Jcf4G] and later elaborated upon by Amari 
and others [ANOO]. This provides invariance under reparametrization 
of and, importantly, minimizes the change of diversity of Pg. 

The algorithm is constructed in two steps: we first give an "ideal" version, 
namely, a version in which time t is continuous so that the evolution of 9^ is 
given by an ordinary differential equation in Q. Second, the actual algorithm 
is a time discretization using a finite time step and Monte Carlo sampling 
instead of exact Pg-averages. 

1.1 The natural gradient on parameter space 

About gradients and the shortest path uphill. Let 5 be a smooth 
function from M'^ to M, to be maximized. We first present the interpretation 
of gradient ascent as "the shortest path uphill". 
Let y E M'^. Define the vector z by 

z = limaig max g{y + ez). (1) 

^^'^ z, \\z\\^l 

Then one can check that z is the normalized gradient of g at y: Zi — ||9g/9yfc || • 
(This holds only at points y where the gradient of g does not vanish.) 

This shows that, for small 6t, the well-known gradient ascent of g given 

by 

„.t+St _ t A_ Xf dg_ 

realizes the largest increase in the value of g, for a given step size Hy*''''^* ~y*ll- 
The relation (1) depends on the choice of a norm || • || (the gradient of 
g is given by dg/dyi only in an orthonormal basis). If we use, instead of 

the standard metric \\y — y'\\ = ^Jj^iUi ~ u'iY ^'^^ a metric \\y — y'\\A = 

■\JYj -^ijiVi ~ y'diVj ~ y'j) defined by a positive definite matrix Aij, then the 

gradient of g with respect to this metric is given by J2j ^ij^§y'- (This follows 
from the textbook definition of gradients by g{y + ez) = g{y) + e(Vg, z)a + 
O(e^) with (•, ■)a the scalar product associated with the matrix Aij [Scli92].) 

We can write the analogue of (1) using the ^-norm. We get that the 
gradient ascent associated with metric ^, given by 

y'+'' =y' + 5tA-^ll-, 

for small 5t, maximizes the increment of g for a given A-distance ||y*'^''* — 
— it realizes the steepest ^-ascent. Maybe this viewpoint clarifies the 
relationship between gradient and metric: this steepest ascent property can 
actually be used as a definition of gradients. 

In our setting we want to use a gradient ascent in the parameter space 
of our distributions Pg. The metric ||^ — = \J^{9i — 9[Y clearly depends 
on the choice of parametrization 9, and thus is not intrinsic. Therefore, we 
use a metric depending on 9 only through the distributions Pg, as follows. 
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Fisher information and the natural gradient on parameter space. 

Let 6^9' £ Q be two values of the distribution parameter. The Kullback- 
Leibler divergence between Pq and Pgi is defined [Ivul97] as 

Pe'ix) 



KL{Pg,\\Pe)= I \n-Pf^Pe.{dx). 
Jx Peix) 



When 6' = 9 + 69 is close to 9, under mild smoothness assumptions we 
can expand the Kullback-Leibler divergence at second order in 69. This 
expansion defines the Fisher information matrix I at 9 [Kul97]: 

KL{Pe+5e \\ Pe) = \j2 ^^i^^i + 0{S9^)- 

An equivalent definition of the Fisher information matrix is by the usual 
formulas [C'T(J6] 

''^^^^ - Jx d9 — --Jx 89.89, 

The Fisher information matrix defines a (Riemannian) metric on 0: the 
distance, in this metric, between two very close values of 9 is given by the 
square root of twice the Kullback-Leibler divergence. Since the Kullback- 
Leibler divergence depends only on Pq and not on the parametrization of 9, 
this metric is intrinsic. 

If 5 : G — ^ M is a smooth function on the parameter space, its natural 
gradient at 9 is defined in accordance with the Fisher metric as 



or more synthetically 



From now on, we will use to denote the natural gradient and ^ to denote 
the vanilla gradient. 

By construction, the natural gradient descent is intrinsic: it does not 
depend on the chosen parametrization 9 of Pq^ so that it makes sense to 
speak of the natural gradient ascent of a function g{Pg)- 

Given that the Fisher metric comes from the Kullback-Leibler diver- 
gence, the "shortest path uphill" property of gradients mentioned above 
translates as follows (see also [Ama98, Theorem 1]): 

Proposition 1. The natural gradient ascent points in the direction 69 achiev- 
ing the largest change of the objective function, for a given distance between 
Pq and Pe+se Kullback-Leibler divergence. More precisely, let g be a 
smooth function on the parameter space Q. Let 9 £ Q be a point where 
Vg{9) does not vanish. Then 

^ — lim — argmax g{9 + 69). 



II Vo(6') II e 59 such that 

mPe+se\\PoKe^/2 

Here we have implicitly assumed that the parameter space is non- 
degenerate and proper (that is, no two points G define the same proba- 
bility distribution, and the mapping Pq ^ 9 \s continuous). 
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Why use the Fisher metric gradient for optimization? Relation- 
ship to diversity. The first reason for using the natural gradient is its 
reparametrization invariance, which makes it the only gradient available in 
a general abstract setting [ANOO]. Practically, this invariance also limits 
the influence of encoding choices on the behavior of the algorithm. More 
prosaically, the Fisher matrix can be also seen as an adaptive learning rate 
for different components of the parameter vector Of. components i with a 
high impact on Pg will be updated more cautiously. 

Another advantage comes from the relationship with Kullback-Leibler 
distance in view of the "shortest path uphill" (see also [Ama98]). To mini- 
mize the value of some function g{6) defined on the parameter space 0, the 
naive approach follows a gradient descent for g using the "vanilla" gradient 

0i^''=0l + 6t§- 

and, as explained above, this maximizes the increment of g for a given 
increment — 9^\\. On the other hand, the Fisher gradient 

9j'-''=9l + 5t 

maximizes the increment of g for a given Kullback-Leibler distance KL(Pgt+« || Pet). 

In particular, if we choose an initial value 9^ such that Pgo covers a wide 
portion of the space X uniformly, the Kullback-Leibler divergence between 
Pgt and Pgo measures the loss of diversity of Pgt . So the natural gradient 
descent is a way to optimize the function g with minimal loss of diversity, 
provided the initial diversity is large. On the other hand the vanilla gradient 
descent optimizes g with minimal change in the numerical values of the 
parameter 9, which is of little interest. 

So arguably this method realizes the best trade-off between optimization 
and loss of diversity. (Though, as can be seen from the detailed algorithm 
description below, maximization of diversity occurs only greedily at each 
step, and so there is no guarantee that after a given time, IGO will provide 
the highest possible diversity for a given objective function value.) 

An experimental confirmation of the positive infiuence of the Fisher ma- 
trix on diversity is given in Section 5 below. This may also provide a theo- 
retical explanation to the good performance of CMA-ES. 

1.2 IGO: Information-geometric optimization 

Quantile rewriting of /. Our original problem is to minimize a function 
/ : X — > M. A simple way to turn / into a function on Q is to use the 
expected value —Kp^f [BcrOO, WSPS08], but expected values can be unduly 
infiuenced by extreme values and using them is rather unstable [Whi89]; 
moreover — Ep^/ is not invariant under increasing transformation of / (this 
invariance implies we can only compare /-values, not add them). 

Instead, we take an adaptive, quantile-based approach by first replac- 
ing the function / with a monotone rewriting Wg and then following the 
gradient of KpgWg . A due choice of Wg allows to control the range of the 
resulting values and achieves the desired invariance. Because the rewriting 
Wq depends on 9, it might be viewed as an adaptive /-transformation. 

The goal is that if f{x) is "small" then w/(x) G R is "large" and vice 
versa, and that Wg remains invariant under monotone transformations of 
/. The meaning of "small" or "large" depends on 9 £ @ and is taken with 
respect to typical values of / under the current distribution Pg. This is 
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measured by the Pg-quantile in which the value of f{x) lies. We write the 
lower and upper Pe-/-quantiles of x G X as 

g,-(x)=Pr,,,^P,(/(x')</(x)) 

g+(x)=Pr,,.p//(x')^/(x)) . ^ ' 

These quantile functions reflect the probability to sample a better value than 
f{x). They are monotone in / (if f{xi) ^ /(X2) then q^{xi) ^ q^{x2)) and 
invariant under increasing transformations of /. 

Given q £ [0; 1], we now choose a non-increasing function w : [0; 1] — )■ M 
(fixed once and for all). A typical choice for w is 'w{q) = Igs^q^ for some 
fixed value qo, the selection quantile. The transform Wg (x) is defined as a 
function of the Pg-Z-quantile of x as 

, {wiqoix)) q^ix) = qgix), 

W/(x) = { . rq=q^{^) , . (3) 
I , , — \ ®, , w(q)Aq otherwise. 
I %{^)-% W 1=% (^0 

As desired, the definition of W/ is invariant under an increasing transforma- 
tion of /. For instance, the Pg-median of / gets remapped to w(}^. 

f 1 

Note that Ep.VFg^ = /g" w is independent of / and 9: indeed, by definition, 
the quantile of a random point under Pg is uniformly distributed in [0; 1]. In 
the following, our objective will be to maximize the expected value of Wgt 
in 9, that is, to maximize 

^PeWit = J Wg{{x)Pe{dx) (4) 

over 9, where 9^ is fixed at a given step but will adapt over time. 

Importantly, w/(x) can be estimated in practice: indeed, the quantiles 
Pix'r^Pg (/(x') < /(x)) can be estimated by taking samples of Pg and ordering 
the samples according to the value of / (see below). The estimate remains 
invariant under increasing /-transformations. 



The IGO gradient flow. At the most abstract level, IGO is a continuous- 
time gradient flow in the parameter space 0, which we deflne now. In 
practice, discrete time steps (a.k.a. iterations) are used, and Pg-integrals are 
approximated through sampling, as described in the next section. 

Let 9^ be the current value of the parameter at time t, and let 6t <^ 1. 
We define ^*+<'* in such a way as to increase the Pg-weight of points where / 
is small, while not going too far from Pgt in Kullback-Leibler divergence. We 
use the adaptive weights W^t as a way to measure which points have large 
or small values. In accordance with (4), this suggests taking the gradient 
ascent 

gt+st ^Qt^g^y^J ^/ 

where the natural gradient is suggested by Proposition 1. 

Note again that we use Wgt and not Wg in the integral. So the gradient 
Vg does not act on the adaptive objective Wgt- If we used Wg instead, we 
would face a paradox: right after a move, previously good points do not 
seem so good any more since the distribution has improved. More precisely, 

f 1 

J Wg (x) Pg{dx) is constant and always equal to the average weight /q w, 
and so the gradient would always vanish. 

Using the log-likelihood trick VPg = Pg VlnPg (assuming Pg is smooth), 
we get an equivalent expression of the update above as an integral under 
the current distribution Pgt; this is important for practical implementation. 
This leads to the following definition. 
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Definition 2 (IGO flow). The IGO flow is the set of continuous-time tra- 
jectories in space @, defined by the differential equation 

_ = V,y T^/.(x)P,(dx) (6) 

= j (x) Ve In Pe (x) Pgt (dx) (7) 

= rHo^)JwUx)'-^^p,4dx). (8) 

where the gradients are taken at point 9 = 6'', and I is the Fisher information 
matrix. 

Natural evolution strategies (NES, [WSPS08, GSS+10, SWSS09]) feature 
a related gradient descent with f{x) instead of Wgtix). The associated flow 
would read 

^ = -V,y f{x)Pe{dx) , (9) 

where the gradient is taken at 0* (in the sequel when not explicitly stated, 
gradients in 6 are taken at 9 = 9'). However, in the end NESs always 
implement algorithms using sample quantiles, as if derived from the gradient 
ascent of W^ti^x). 

The update (7) is a weighted average of "intrinsic moves" increasing the 
log-likelihood of some points. We can slightly rearrange the update as 

prefe rence wei ght current sample distribution 

^ = 1 W^,{x) Ve\^Pe{x) P^^ (10) 

intrinsic move to reinforce x 

= VeJ VF/,(x)lnPg(x) Pe*(dx). (11) 

weighted log-likelihood 

which provides an interpretation for the IGO gradient flow as a gradient 
ascent optimization of the weighted log-likelihood of the "good points" of 
the current distribution. In a precise sense, IGO is in fact the "best" way 
to increase this log- likelihood (Theorem 13). 

For exponential families of probability distributions, we will see later 
that the IGO flow rewrites as a nice derivative- free expression (18). 

The IGO algorithm. Time discretization and sampling. The above 
is a mathematically well-defined continuous-time flow in the parameter space. 
Its practical implementation involves three approximations depending on 
two parameters N and 5t: 

• the integral under Pgt is approximated using A'^ samples taken from 

• the value Wgt is approximated for each sample taken from Pgt; 

• the time derivative ^ is approximated by a 5t time increment. 

We also assume that the Fisher information matrix 1(9) and 
can be computed (see discussion below if 1(9) is unkown). 

At each step, we pick N samples xi, . . . ,X]\f under Pgt. To approximate 
the quantiles, we rank the samples according to the value of /. Define 
rk(xi) = f{xj) < f{xi)} and let the estimated weight of sample Xi be 

^ 1 /rk(xi) + l/2\ 

Wi = ^w{ ' (12) 



iV V N 
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using the weighting scheme function w introduced above. (This is assuming 
there are no ties in our sample; in case several sample points have the same 
value of /, we define Wi by averaging the above over all possible rankings of 
the ties^.) 

Then we can approximate the IGO flow as follows. 

Definition 3 (IGO algorithm). The IGO algorithm with sample size N and 
step size 6t is the following update rule for the parameter 9^. At each step, 
N sample points xi, . . . ,xn are picked according to the distribution Pgt . The 
parameter is updated according to 



N 

Qt+5t ^Qt^^^Y^^, VelnPe(3;i) (13) 

2 = 1 

N 



e' + 5tr\e') ^ 



Wi 



0=e* 
din Pg{xi) 



. , 89 

1=1 



(14) 

6»=6l* 



where Wi is the weight (12) obtained from the ranked values of the objective 
function f . 

Equivalently one can fix the weights Wi = j^w (j-^/^^ once and for all 
and rewrite the update as 



(15) 

=0t 



where Xi-N denotes the i^^ sampled point ranked according to /, i.e. f{xi:N) < 
. . . < f{x]\f:N) (assuming again there are no ties). Note that {xi-j\f} = {xi} 
and {wi} = {wi}. 

As will be discussed in Section 4, this update applied to multivariate nor- 
mal distributions or Bernoulli measures allows to neatly recover versions of 
some well-established algorithms, in particular CMA-ES and PBIL. Actually, 
in the Gaussian context updates of the form (14) have already been intro- 
duced [GSS+10, ANOKIO], though not formally derived from a continuous- 
time flow with quantiles. 

When — 7- oo, the IGO algorithm using samples approximates the 
continuous-time IGO gradient flow, see Theorem 4 below. Indeed, the IGO 
algorithm, with N = oo, is simply the Euler approximation scheme for the 
ordinary differential equation defining the IGO flow (6). The latter result 
thus provides a sound mathematical basis for currently used rank-based 
updates. 

IGO flow versus IGO algorithms. The IGO flow (6) is a well-defined 
continuous-time set of trajectories in the space of probability distributions 
Pq, depending only on the objective function / and the chosen family of dis- 
tributions. It does not depend on the chosen parametrization for 9 (Propo- 
sition 8). 

On the other hand, there are several IGO algorithms associated with this 
flow. Each IGO algorithm approximates the IGO flow in a slightly different 



mathematically neater but less intuitive version would be 

ru=ik+{xi)/N 



1 



■w(u)du 



with rk-(a:0 = fix,) < fix,)} and rk+ix,) = fix,) ^ fix,)}. 
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way. An IGO algorithm depends on three further choices: a sample size A^, 
a time discretization step size 6t, and a choice of parametrization for 9 in 
which to implement (14). 

If 6t is small enough, and N large enough, the influence of the parametriza- 
tion 9 disappears and all IGO algorithms are approximations of the "ideal" 
IGO flow trajectory. However, the larger 6t, the poorer the approximation 
gets. 

So for large 5t, different IGO algorithms for the same IGO flow may 
exhibit different behaviors. We will see an instance of this phenomenon for 
Gaussian distributions: both CMA-ES and the maximum likelihood update 
(EMNA) can be seen as IGO algorithms, but the latter with 5t = 1 is known 
to exhibit premature loss of diversity (Section 4.2). 

Still, two IGO algorithms for the same IGO flow will differ less from 
each other than from a non-IGO algorithm: at each step the difference is 
only 0{6t'^) (Section 2.4). On the other hand, for instance, the difference 
between an IGO algorithm and the vanilla gradient ascent is, generally, not 
smaller than 0{5t) at each step, i.e. roughly as big as the steps themselves. 

Unknown Fisher matrix. The algorithm presented so far assumes that 
the Fisher matrix I{9) is known as a function of 9. This is the case for 
Gaussian distributions in CMA-ES and for Bernoulli distributions. How- 
ever, for restricted Boltzmann machines as considered below, no analytical 
form is known. Yet, provided the quantity ■^hiP0{x) can be computed or 
approximated, it is possible to approximate the integral 



using Pg-Monte Carlo samples for x. These samples may or may not be the 
same as those used in the IGO update (14): in particular, it is possible to 
use as many Monte Carlo samples as necessary to approximate Ijj, at no 
additional cost in terms of the number of calls to the black-box function / 
to optimize. 

Note that each Monte Carlo sample x will contribute ^^"qq^^^ ^^^m-^^^ 
the Fisher matrix approximation. This is a rank-1 matrix. So, for the ap- 
proximated Fisher matrix to be invertible, the number of (distinct) samples 
X needs to be at least equal to the number of components of the parameter 
9 i.e. Apisher ^ dim 6. 

For exponential families of distributions, the IGO update has a particular 
form (18) which simplifies this matter somewhat. More details are given 
below (see Section 5) for the concrete situation of restricted Boltzmann 
machines. 

2 First properties of IGO 
2.1 Consistency of sampling 

The first property to check is that when N — >■ oo, the update rule using N 
samples converges to the IGO update rule. This is not a straightforward 
application of the law of large numbers, because the estimated weights idi 
depend (non-continuously) on the whole sample xi, . . . ,xn, and not only on 
x^. 
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Theorem 4 (Consistency). When N oo, the N -sample IGO update rule 
(14): 



converges with probability 1 to i/ie update rule (5); 

The proof is given in the Appendix, under mild regularity assumptions. 
In particular we do not require that w be continuous. 

This theorem may clarify previous claims [WSPS08, ANOKIO] where 

rank-based updates similar to (5), such as in NES or CMA-ES, were derived 

from optimizing the expected value — Ep^/. The rank-based weights Wi were 

then introduced somewhat arbitrarily. Theorem 4 shows that, for large A'^, 

J? 

CMA-ES and NES actually follow the gradient flow of the quantity E Wgt '■ 
the update can be rigorously derived from optimizing the expected value of 
the quantile-rewriting W^. 

2.2 Monotonicity: quantile improvement 

Gradient descents come with a guarantee that the fitness value decreases 
over time. Here, since we work with probability distributions on A, we 
need to define the fitness of the distribution Pgt. An obvious choice is 
the expectation Kp^^f, but it is not invariant under /-transformation and 
moreover may be sensitive to extreme values. 

It turns out that the monotonicity properties of the IGO gradient fiow 
depend on the choice of the weighting scheme w. For instance, if w{u) = 
1m^i/2) then the median of / improves over time. 

Proposition 5 (Quantile improvement). Consider the IGO flow given by (6), 
with the weight w{u) = lu^q where < q < 1 is fixed. Then the value of 
the q-quantile of f improves over time: if ti ^ t2 then either 9^^ = 6^^ or 

Here the q-quantile Qpif) of f under a probability distribution P is 
defined as any number m such thatVix'^pif {x) ^ m) ^ q andVix'^p{f{x) ^ 
m) ^ 1 — q. 

The proof is given in the Appendix, together with the necessary regular- 
ity assumptions. 

Of course this holds only for the IGO gradient flow (6) with N = oo and 
5t — >■ 0. For an IGO algorithm with flnite N, the dynamics is random and 
one cannot expect monotonicity. Still, Theorem 4 ensures that, with high 
probability, trajectories of a large enough finite population dynamics stay 
close to the infinite-population limit trajectory. 

2.3 The IGO flow for exponential families 

The expressions for the IGO update simplify somewhat if the family Pg 
happens to be an exponential family of probability distributions (see also 
[MMS08]). Suppose that Pg can be written as 

Peix) = ^exp (5]eir,(x)) H{dx) 
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where Ti, . . . , T^, is a finite family of functions on X, H{dx) is an arbitrary 
reference measure on X, and Z{6) is the normahzation constant. It is well- 
known [ANOO, (2.33)] that 

—^^ = T,ix)-Ep,T, (16) 

so that [ANOO, (3.59)] 

Iij{e) = Covp,{T„Tj). (17) 

In the end we find: 

Proposition 6. Let Pg be an exponential family parametrized by the natural 
parameters 9 as above. Then the IGO flow is given by 

- = Covp,(r,T)-i Cov p,{T,wi) (18) 

where Covpg(T, w/) denotes the vector (Covp^ (Tj, w/))i, and Covpg{T,T) 
the matrix {Cov Pg {Ti, Tj))ij . 

Note that the right-hand side does not involve derivatives w.r.t. 9 any 
more. This result makes it easy to simulate the IGO flow using e.g. a Gibbs 
sampler for Pg: both covariances in (18) may be approximated by sampling, 
so that neither the Fisher matrix nor the gradient term need to be known 
in advance, and no derivatives are involved. 

The CMA-ES uses the family of all Gaussian distributions on M.'^. Then, 
the family Tj is the family of all linear and quadratic functions of the coor- 
dinates on M*^. The expression above is then a particularly concise rewriting 
of a CMA-ES update, see also Section 4.2. 

Moreover, the expected values Tj = ET, of Tj satisfy the simple evolution 
equation under the IGO flow 

1 rp 

= Cov{Ti , Wi) = E{Ti M^/) - fi EWgf. (19) 

The proof is given in the Appendix, in the proof of Theorem 15. 

The variables Tj can sometimes be used as an alternative parametrization 
for an exponential family (e.g. for a one-dimensional Gaussian, these are 
the mean fj, and the second moment /i^ + cj^). Then the IGO flow (7) 
may be rewritten using the relation Vg. = for the natural gradient of 

exponential families (Appendix, Proposition 22), which sometimes results 
in simpler expressions. We shall further exploit this fact in Section 3. 

Exponential families with latent variables. Similar formulas hold 
when the distribution Pg{x) is the marginal of an exponential distribution 
Pg{x,h) over a "hidden" or "latent" variable h, such as the restricted Boltz- 
mann machines of Section 5. 

Namely, with Pg{x) = 6xp(X]i ^j7i(a^, -^(^^j d/i) the Fisher 

matrix is 

Ii,{9) = CoYp,{Ui,Uj) (20) 
where Ui{x) = Epg{Ti{x, h)\x). Consequently, the IGO flow takes the form 
c\9 

- = Covp,(C/,C/)-i Covp,{U,wi). (21) 
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2.4 Invariance properties 

Here we formally state the invariance properties of the IGO flow under vari- 
ous reparametrizations. Since these results follow from the very construction 
of the algorithm, the proofs are omitted. 

Proposition 7 (/-invariance). Let ip : M ^ M be an increasing function. 
Then the trajectories of the IGO flow when optimizing the functions f and 
(p{f) are the same. 

The same is true for the discretized algorithm with population size N 
and step size 5t > 0. 

Proposition 8 (0-invariance). Let 9' = ip{0) be a one-to-one function of 
9 and let Pg, = P^-i(^gy Let 9^ be the trajectory of the IGO flow when 
optimizing a function f using the distributions Pq, initialized at 9^. Then 
the IGO flow trajectory {9'Y obtained from the optimization of the function f 
using the distributions Pq,, initialized at {9')^ = ^{9^), is the same, namely 
[ej = ip{9'). 

For the algorithm with finite and 5t > 0, invariance under ^-reparame- 
trization is only true approximately, in the limit when 5t — ^ 0. As mentioned 
above, the IGO update (14), with = oo, is simply the Euler approximation 
scheme for the ordinary differential equation (6) defining the IGO flow. At 
each step, the Euler scheme is known to make an error 0(5t'^) with respect 
to the true flow. This error actually depends on the parametrization of 9. 

So the IGO updates for different parametrizations coincide at first order 
in 5t, and may, in general, differ by 0{6t'^). For instance the difference 
between the CMA-ES and xNES updates is indeed 0{6t'^), see Section 4.2. 

For comparison, using the vanilla gradient results in a divergence oiO{6t) 
at each step between different parametrizations. So the divergence could be 
of the same magnitude as the steps themselves. 

In that sense, one can say that IGO algorithms are "more parametrization- 
invariant" than other algorithms. This stems from their origin as a discretiza- 
tion of the IGO flow. 

The next proposition states that, for example, if one uses a family of 
distributions on M*^ which is invariant under affine transformations, then 
our algorithm optimizes equally well a function and its image under any 
affine transformation (up to an obvious change in the initialization). This 
proposition generalizes the well-known corresponding property of CMA-ES 
[HOOl]. 

Here, as usual, the image of a probability distribution P by a transfor- 
mation (p is defined as the probability distribution P' such that P'{Y) = 
P{(p~^{Y)) for any subset Y C X. In the continuous domain, the density of 
the new distribution P' is obtained by the usual change of variable formula 
involving the Jacobian of (p. 

Proposition 9 (X-invariance) . Let ip : X ^ X be a one-to-one transfor- 
mation of the search space, and assume that ip> globally preserves the family 
of measures Pq. Let 9^ be the IGO flow trajectory for the optimization of 
function f , initialized at Pgo. Let {9'Y be the IGO flow trajectory for opti- 
mization of f o ^p^^ , initialized at the image of Pqo by (p. Then P^e'y is the 
image of Pgt by ip. 

For the discretized algorithm with population size N and step size 5t > 0, 
the same is true up to an error ofO{6t'^) per iteration. This error disappears 
if the map (p acts on Q in an affine way. 
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The latter case of afHne transforms is well exemplified by CMA-ES: here, 
using the variance and mean as the parametrization of Gaussians, the new 
mean and variance after an afhne transform of the search space arc an affine 
function of the old mean and variance; specifically, for the affine transforma- 
tion A : X Ax + b we have (m, C) i-)- {Am + b, ACA^). 

2.5 Speed of the IGO flow 

Proposition 10. The speed of the IGO flow, i.e. the norm of ^ in the 
Fisher metric, is at most ^ — (/q^ iw)^ where w is the weighting scheme. 

This speed can be tested in practice in at least two ways. The first is 
just to compute the Fisher norm of the increment ^*+^* — 0^ using the Fisher 
matrix; for small 5t this is close to with || • || the Fisher metric. The 

second is as follows: since the Fisher metric coincides with the Kullback- 
Leibler divergence up to a factor 1/2, we have KL{PQt+st \\ Pet) ~ ^<5i^||^|P 
at least for small St. Since it is relatively easy to estimate KL(Pgt+a || Pgt) 
by comparing the new and old log-likelihoods of points in a Monte Carlo 
sample, one can obtain an estimate of ||^||. 

Corollary 11. Consider an IGO algorithm with weighting scheme w, step 
size 6t and sample size N. Then, for small 6t and large N we have 

KL{Pet+6t II Pet) ^ ^St^ Var[o,i] w + 0{St^) + 0{1/Vn). 

For instance, with w{q) = Iq-^q^ and neglecting the error terms, an IGO 
algorithm introduces at most ^St^ qo(l — qo) bits of information (in base e) 
per iteration into the probability distribution Pg. 

Thus, the time discretization parameter St is not just an arbitrary vari- 
able: it has an intrinsic interpretation related to a number of bits introduced 
at each step of the algorithm. This kind of relationship suggests, more gen- 
erally, to use the KuUback-Leibler divergence as an external and objective 
way to measure learning rates in those optimization algorithms which use 
probability distributions. 

The result above is only an upper bound. Maximal speed can be achieved 
only if all "good" points point in the same direction. If the various good 
points in the sample suggest moves in inconsistent directions, then the IGO 
update will be much smaller. The latter may be a sign that the signal is 
noisy, or that the family of distributions Pe is not well suited to the problem 
at hand and should be enriched. 

As an example, using a family of Gaussian distributions with unkown 
mean and fixed identity variance on W^, one checks that for the optimization 
of a linear function on M^, with the weight w{u) = —tu>i/2 + 1m<i/2) the 
IGO flow moves at constant speed l/-\/27r 0.4, whatever the dimension 
d. On a rapidly varying sinusoidal function, the moving speed will be much 
slower because there are "good" and "bad" points in all directions. 

This may suggest ways to design the weighting scheme w to achieve max- 
imal speed in some instances. Indeed, looking at the proof of the proposition, 
which involves a Cauchy-Schwarz inequality, one can see that the maximal 
speed is achieved only if there is a linear relationship between the weights 
Wg {x) and the gradient lnPe{x). For instance, for the optimization of a 
linear function on R'^ using Gaussian measures of known variance, the max- 
imal speed will be achieved when the weighting scheme w{u) is the inverse 
of the Gaussian cumulative distribution function. (In particular, w{u) tends 
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to +00 when n — )■ and to —00 when u ^ 1.) This is in accordance with 
previously known results: the expected value of the i-th order statistic of N 
standard Gaussian variates is the optimal idi value in evolution strategies 
[BcyOl, Arn06]. For N ^ 00, this order statistic converges to the inverse 
Gaussian cumulative distribution function. 



2.6 Noisy objective function 

Suppose that the objective function / is non-deterministic: each time we 
ask for the value of / at a point x € X, we get a random result. In this 
setting we may write the random value f(x) as f{x) = f(x,uj) where to is 
an unseen random parameter, and / is a deterministic function of x and oj. 
Without loss of generality, up to a change of variables we can assume that 
u is uniformly distributed in [0, 1]. 

We can still use the IGO algorithm without modification in this con- 
text. One might wonder which properties (consistency of sampling, etc.) 
still apply when / is not deterministic. Actually, IGO algorithms for noisy 
functions fit very nicely into the IGO framework: the following proposition 
allows to transfer any property of IGO to the case of noisy functions. 

Proposition 12 (Noisy IGO). Let f be a random function of x G X, 
namely, f{x) = f{x,uj) where u is a random variable uniformly distributed 
in [0, 1], and f is a deterministic function of x and ui. Then the two follow- 
ing algorithms coincide: 

• The IGO algorithm (13), using a family of distributions Pq on space 
X, applied to the noisy function f , and where the samples are ranked 
according to the random observed value of f (here we assume that, for 
each sample, the noise oj is independent from everything else); 

• The IGO algorithm on space X x [0, 1], using the family of distributions 
Pg = Pq <Si t^[o,i]7 applied to the deterministic function f. Here ?7[o,i] 
denotes the uniform law on [0, 1]. 

The (easy) proof is given in the Appendix. 

This proposition states that noisy optimization is the same as ordinary 
optimization using a family of distributions which cannot operate any selec- 
tion or convergence over the parameter cv. More generally, any component of 
the search space in which a distribution-based evolutionary strategy cannot 
perform selection or specialization will effectively act as a random noise on 
the objective function. 

As a consequence of this result, all properties of IGO can be transferred 
to the noisy case. Consider, for instance, consistency of sampling (Theo- 
rem 4). The A^-sample IGO update rule for the noisy case is identical to the 
non- noisy case (14): 



ad 



where each weight Wi computed from (12) now incorporates noise from the 
objective function because the rank of Xi is computed on the random func- 
tion, or equivalently on the deterministic function /: rk(xj) = /(xj, cjj) < 

/(Xi,Wi)}. 

Consistency of sampling (Theorem 4) thus takes the following form: 
When N ^ 00, the A^-sample IGO update rule on the noisy function / 
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converges with probability 1 to the update rule 




(22) 



where Wjj (x) = E^W^ (x,a;). This entails, in particular, that when N ^ oo, 
the noise disappears asymptotically, as could be expected. 

Consequently, the IGO flow in the noisy case should be defined by the 
5t — >■ limit of the update (22) using W. Note that the quantiles q^{x) de- 
fined by (2) still make sense in the noisy case, and are deterministic functions 
of x; thus Wg (x) can also be defined by (3) and is deterministic. However, 
unless the weighting scheme w{q) is affine, Wq (x) is different from Wg (x) 
in general. Thus, unless w is affine the fiows defined by W and W do not 
coincide in the noisy case. The fiow using W would be the N ^ oo limit 
of a slightly more complex algorithm using several evaluations of / for each 
sample Xi in order to compute noise- free ranks. 

2.7 Implementation remarks 

Influence of the weighting scheme w. The weighting scheme w directly 
affects the update rule (15). 

A natural choice is w^u) = l^^g. This, as we have proved, results in 
an improvement of the g-quantile over the course of optimization. Taking 
q = 1/2 springs to mind; however, this is not selective enough, and both 
theory and experiments confirm that for the Gaussian case (CMA-ES), most 
efficient optimization requires q < 1/2 (see Section 4.2). The optimal q is 
about 0.27 if N is not larger than the search space dimension d [BcyOl] and 
even smaller otherwise. 

Second, replacing w with w+c for some constant c clearly has no infiuence 
on the IGO continuous-time flow (5), since the gradient will cancel out the 
constant. However, this is not the case for the update rule (15) with a flnite 
sample of size A^. 

Indeed, adding a constant c to w adds a quantity c-^ In ^^(xj) to 
the update. Since we know that the Pg-expected value of In Pg is 
(because J{VelnPe)Pe = J^Pe = VI = 0), we have Ej^VelnPeixi) = 0. 
So adding a constant to w does not change the expected value of the update, 
but it may change e.g. its variance. The empirical average of V6ilnPe(xj) 
in the sample will be 0{1/^/N). So translating the weights results in a 
0(l/\/iV) change in the update. See also Section 4 in [SWSS09]. 

Determining an optimal value for c to reduce the variance of the update is 
difficult, though: the optimal value actually depends on possible correlations 
between VglnPg and the function /. The only general result is that one 
should shift w so that lies within its range. Assuming independence, or 
dependence with enough symmetry, the optimal shift is when the weights 
average to 0. 

Adaptive learning rate. Comparing consecutive updates to evaluate a 
learning rate or step size is an effective measure. For example, in back- 
propagation, the update sign has been used to adapt the learning rate of 
each single weight in an artificial neural network [SA90]. In CMA-ES, a 
step size is adapted depending on whether recent steps tended to move in 
a consistent direction or to backtrack. This is measured by considering 
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the changes of the mean m of the Gaussian distribution. For a probabihty 
distribution Pq on an arbitrary search space, in general no notion of mean 
may be defined. However, it is still possible to define "backtracking" in the 
evolution of 9 as follows. 

Consider two successive updates 69^ = 0* — and 59^^^ = 6'*+'^* — 0*. 
Their scalar product in the Fisher metric I{9*) is 

{59\ 59'+^') = J2 hj{9') S9l S9^+^K 

Dividing by the associated norms will yield the cosine cosq of the angle 
between 59^ and 69^+^. 

If this cosine is positive, the learning rate St may be increased. If the 
cosine is negative, the learning rate probably needs to be decreased. Various 
schemes for the change of 6t can be devised; for instance, inspired by CMA- 
ES, one can multiply 5t by exp(/3(cos q)/2) or exp(/3(lcosa>o - 1cosq<o)/2), 
where /? « min(A^/ dimO, 1/2). 

As before, this scheme is constructed to be robust w.r.t. reparametriza- 
tion of 9, thanks to the use of the Fisher metric. However, for large learning 
rates 5t, in practice the parametrization might well become relevant. 

A consistent direction of the updates does not necessarily mean that the 
algorithm is performing well: for instance, when CEM/EMNA exhibits pre- 
mature convergence (see below), the parameters consistently move towards 
a zero covariance matrix and the cosines above are positive. 

Complexity. The complexity of the IGO algorithm depends much on the 
computational cost model. In optimization, it is fairly common to assume 
that the objective function / is very costly compared to any other calcula- 
tions performed by the algorithm. Then the cost of IGO in terms of number 
of /-calls is per iteration, and the cost of using quantiles and computing 
the natural gradient is negligible. 

Setting the cost of / aside, the complexity of the IGO algorithm depends 
mainly on the computation of the (inverse) Fisher matrix. Assume an an- 
alytical expression for this matrix is known. Then, with p = dimG the 
number of parameters, the cost of storage of the Fisher matrix is 0{p'^) per 
iteration, and its inversion typically costs 0{p^) per iteration. However, de- 
pending on the situation and on possible algebraic simplifications, strategies 
exist to reduce this cost (e.g. [LRMB07] in a learning context). For instance, 
for CMA-ES the cost is 0{Np) [SHI09]. More generally, parametrization by 
expectation parameters (see above), when available, may reduce the cost to 
0{p) as well. 

If no analytical form of the Fisher matrix is known and Monte Carlo es- 
timation is required, then complexity depends on the particular situation at 
hand and is related to the best sampling strategies available for a particular 
family of distributions. For Boltzmann machines, for instance, a host of such 
strategies are available. Still, in such a situation, IGO can be competitive if 
the objective function / is costly. 

Recycling samples. We might use samples not only from the last iter- 
ation to compute the ranks in (12), see e.g. [SWSS09]. For = 1 this 
is indispensable. In order to preserve sampling consistency (Theorem 4) 
the old samples need to be reweighted (using the ratio of their new vs old 
likelihood, as in importance sampling). 
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Initialization. As with other optimization algorithms, it is probably a 
good idea to initialize in such a way as to cover a wide portion of the search 
space, i.e. 6^ should be chosen so that Pqo has maximal diversity. For IGO 
algorithms this is particularly relevant, since, as explained above, the natural 
gradient provides minimal change of diversity (greedily at each step) for a 
given change in the objective function. 

3 IGO, maximum likelihood and the cross-entropy 
method 

IGO as a smooth-time maximum likelihood estimate. The IGO 

flow turns out to be the only way to maximize a weighted log-likelihood, 
where points of the current distribution are slightly reweighted according to 
/-preferences. 

This relies on the following interpretation of the natural gradient as a 
weighted maximum likelihood update with infinitesimal learning rate. This 
result singles out, in yet another way, the natural gradient among all possible 
gradients. The proof is given in the Appendix. 

Theorem 13 (Natural gradient as ML with infinitesimal weights). Let e > 
and Oq G O. Let W{x) be a function of x and let 6 be the solution of 

e = argmax|(l-e) J \ogPe{x) Pe^idx) + e J log Pg{x) W (x) Peo{dx)^. 

maximal for 9 = 0o 

Then, when e — )■ 0, up to O(e^) we have 

9 = eo + el Vein Pg{x) W{x)Pe,{dx). 
Likewise for discrete samples: with xi, . . . ,xn €z X , let 6 be the solution of 

6* = argmax |(1 - e) J log Pe{x) Peo{dx) + eY^W{xi) log Pe{xi)^ . 

Then when e — >■ 0, up to O(e^) we have 

e = 0o + £Y,W{xi) Vein Pe{xi). 

i 

So if W{x) = Wq^{x) is the weight of the points according to quantilized 
/-preferences, the weighted maximum log-likelihood necessarily is the IGO 
flow (7) using the natural gradient — or the IGO update (14) when using 
samples. 

Thus the IGO flow is the unique flow that, continuously in time, slightly 
changes the distribution to maximize the log-likelihood of points with good 
values of /. Moreover IGO continuously updates the weight Wg^{x) depend- 
ing on / and on the current distribution, so that we keep optimizing. 

This theorem suggests a way to approximate the IGO flow by enforcing 
this interpretation for a given non-infinitesimal step size 5t, as follows. 

Definition 14 (IGO-ML algorithm). The IGO-ML algorithm with step size 
5t updates the value of the parameter 9^ according to 

=argmax|(l-5ti:^5i) / log Pe{x) Pet {dx) + 5tY,WilogPe{xi)^ 

(23) 
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where xi, . . . , xi\[ are sample points picked according to the distribution PQt, 
and Wi is the weight (12) obtained from the ranked values of the objective 
function f . 

As for the cross-entropy method below, this only makes algorithmic sense 
if the argmax is tractable. 

It turns out that IGO-ML is just the IGO algorithm in a particular 
parametrization (see Theorem 15). 

The cross-entropy method. Taking 5t = 1 in (23) above corresponds 
to a full maximum likelihood update, which is also related to the cross- 
entropy method (GEM). The cross-entropy method can be defined as follows 
[dBKMR05] in an optimization setting. Like IGO, it depends on a family 
of probability distributions Pg parametrized hy 6 € @, and a number of 
samples N at each iteration. Let also Ne = \qN~\ (0 < g < 1) be a number 
of elite samples. 

At each step, the cross-entropy method for optimization samples A'^ 
points xi, . . . ,xn from the current distribution Pgt. Let Wi be 1/Ne if Xi be- 
longs to the Ne samples with the best value of the objective function /, and 
Wi = otherwise. Then the cross- entropy method or maximum likelihoood 
update (CEM/ML) for optimization is 

= argmax y^iDj log P6i(a^j) (24) 
e 

(assuming the argmax is tractable). 

A version with a smoother update depends on a step size parameter 
< a ^ 1 and is given [dBKMROS] by 

6»*+i = (1 - a)9^ + a argmaxy^ i^^\ogPe{xi). (25) 

e 

The standard GEM/ML update corresponds to a = 1. 

For a = 1 the standard cross-entropy method is independent of the 
parametrization 6, whereas for a < 1 this is not the case. 

Note the difference between the IGO-ML algorithm (23) and the smoothed 
GEM update (25) with step size a = 6t: the smoothed GEM update per- 
forms a weighted average of the parameter value after taking the maximum 
likelihood estimate, whereas IGO-ML uses a weighted average of current 
and previous likelihoods, then takes a maximum likelihood estimate. In gen- 
eral, these two rules can greatly differ, as they do for Gaussian distributions 
(Section 4.2). 

This interversion of averaging makes IGO-ML parametrization-independent 
whereas the smoothed GEM update is not. 

Yet, for exponential families of probability distributions, there exists one 
particular parametrization 9 in which the IGO algorithm and the smoothed 
GEM update coincide. We now proceed to this construction. 

IGO for expectation parameters and maximum likelihood. The 

particular form of IGO for exponential families has an interesting conse- 
quence if the parametrization chosen for the exponential family is the set of 
expectation parameters. Let Pe{x) = exp {J2 0jTj{x)) H{dx) be an expo- 
nential family as above. The expectation parameters are Tj = Tj[6) = Mp^Tj, 
(denoted rjj in [ANOO, (3.56)]). The notation T will denote the collection 
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It is well-known that, in this parametrization, the maximum likelihood 
estimate for a sample of points xi, . . . ,Xk is just the empirical average of 
the expectation parameters over that sample: 

argmaxi^logPj.(xi) = i^T(xi). (26) 

f i=i «^ i=i 

In the discussion above, one main difference between IGO and smoothed 
CEM was whether we took averages before or after taking the maximum 
log- likelihood estimate. For the expectation parameters Tj, we see that 
these operations commute. (One can say that these expectation parameters 
"linearize maximum likelihood estimates".) A little work brings us to the 

Theorem 15 (IGO, CEM and maximum likelihood). Let 

Pe{x) = ^exp {j2(^,T,{x)) if(dx) 

be an exponential family of probability distributions, where the Tj are func- 
tions of X and H is some reference measure. Let us parametrize this family 
by the expected values Tj = ET,-. 

Let us assume the chosen weights wi sum to 1. For a sample xi, . . . ,xn, 

let 

T* = ^WiTj{xi). 

i 

Then the IGO update (14) in this parametrization reads 

rj^t+5t ft ^ J.* _ (27) 

Moreover these three algorithms coincide: 

• The IGO-ML algorithm (23). 

• The IGO algorithm written in the parametrization Tj. 

• The smoothed CEM algorithm (25) written in the parametrization Tj, 
with a = 6t. 

Corollary 16. The standard CEM/ML update (24) is the IGO algorithm 
in parametrization Tj with 5t = 1. 

Beware that the expectation parameters Tj are not always the most 
obvious parameters [ANOO, Section 3.5]. For example, for 1-dimensional 
Gaussian distributions, these expectation parameters are the mean // and 
the second moment fj? + a"^. When expressed back in terms of mean and 
variance, with the update (27) the new mean is (1 — 5t)^ + Stfi* , but the 
new variance is (1 - 5t)a'^ + 5t{a*f + 6t{l - 6t){n* - ^f. 

On the other hand, when using smoothed CEM with mean and variance 
as parameters, the new variance is (1 — 5t)a'^ + (5t(cr*)^, which can be signif- 
icantly smaller for 8t G (0, 1). This proves, in passing, that the smoothed 
CEM update in other parametrizations is generally not an IGO algorithm 
(because it can differ at first order in 5t). 

The case of Gaussian distributions is further exemplified in Section 4.2 
below: in particular, smoothed CEM in the (^, a) parametrization exhibits 
premature reduction of variance, preventing good convergence. 

For these reasons we think that the IGO-ML algorithm is the sensible 
way to perform an interpolated ML estimate for 5t < 1, in a parametrization- 
independent way. In Section 6 we further discuss IGO and CEM and sum 
up the differences and relative advantages. 
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Taking 5t = 1 is a bold approximation choice: the "ideal" continuous- 
time IGO flow itself, after time 1, does not coincide with the maximum 
likelihood update of the best points in the sample. Since the maximum 
likelihood algorithm is known to converge prematurely in some instances 
(Section 4.2), using the parametrization by expectation parameters with 
large 6t may not be desirable. 

The considerable simpliflcation of the IGO update in these coordinates 
reflects the duality of coordinates Tj and Oi. More precisely, the natural 
gradient ascent w.r.t. the parameters Tj is given by the vanilla gradient 
w.r.t. the parameters 9i: 

(Proposition 22 in the Appendix). 

4 CMA-ES, NES, EDAs and PBIL from the IGO 
framework 

In this section we investigate the IGO algorithm for Bernoulli measures and 
for multivariate normal distributions and show the correspondence to well- 
known algorithms. In addition, we discuss the influence of the parametriza- 
tion of the distributions. 

4.1 IGO algorithm for Bernoulli measures and PBIL 

We consider on X = {0, l}"* a family of Bernoulli measures Peix) = pei{xi)x 
. . . X po^{xd) with po^{xi) = ^f'(l — 0i)^~^\ As this family is a product of 
probability measures pg-{xi), the different components of a random vector y 
following Pq are independent and all off-diagonal terms of the Fisher informa- 
tion matrix (FIM) are zero. Diagonal terms are given by g-^j^r^-y. Therefore 
the inverse of the FIM is a diagonal matrix with diagonal entries equal to 
9i{l — Oi). In addition, the partial derivative of InPe(x) w.r.t. Oi can be 
computed in a straightforward manner resulting in 

9 In P£i(x) Xi 1 — Xi 
dOi Oi I — Oi 

Let xi, . . . ,xn be N samples at step t with distribution P^t and let 
xi:Ar, . . . ,xj\f:N be the ranked samples according to /. The natural gradient 
update (15) with Bernoulli measures is then given by 

r ^* = oi + 6t 0\{l - Ot) f: - ^^^) (28) 

where Wj = w{{j — 1/2) /N) /N and [y\i denotes the i"^ coordinate of y G X. 
The previous equation simplifles to 

N 

e*+^^ =0\ + 5tY^WJ([x^..N]^-0f) , 

i=i 

or, denoting w the sum of the weights Y^^=i Wj, 

N 

^t+St -^^^ Qt + 5tY, Wj [Xj.,N]i . 
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If we set the IGO weights as wi = 1, wj = for j = 2,...,N, we 
recover the PBIL/EGA algorithm with update rule towards the best solution 
only (disregarding the mutation of the probability vector), with 5t = LR 
where LR is the so-called learning rate of the algorithm [BalO l, Figure 4]. 
The PBIL update rule towards the /j. best solutions, proposed in [BC95, 
Figure 4]^, can be recovered as well using 

St = LR 

Wj = il-LRy-\ fori = l,...,// 
Wj = 0, for j = fi + 1, . . . , N . 

Interestingly, the parameters 9i are the expectation parameters described 
in Section 3: indeed, the expectation of Xi is 6i. So the formulas above are 
particular cases of (27). Thus, by Theorem 15, PBIL is both a smoothed 
GEM in these parameters and an IGO-ML algorithm. 

Let us now consider another, so-called "logit" representation, given by 
the logistic function P{xi = 1) = — — K—tt- This 9 is the exponential 
parametrization of Section 2.3. We find that 

dlnPpi(x) , exp(— ^j) ^ 
= (xi -1) + — — = Xi- Exi 

(cf. (16)) and that the diagonal elements of the Fisher information matrix 
are given by exp(— ^j)/(l -|-exp(— 6*4))^ = Varxj (compare with (17)). So the 
natural gradient update (15) with Bernoulli measures now reads 

TV 

~,t+5t _ Qt + 5^(1 + exp(g;t)) + + exp(-el)) w,[xj..N 

To better compare the update with the previous representation, note 
that 9i = h—TT and thus we can rewrite 



r, TV 



So the direction of the update is the same as before and is given by the 
proportion of bits set to or 1 in the sample, compared to its expected value 
under the current distribution. The magnitude of the update is different 
since the parameter 9 ranges from —00 to -|-oo instead of from to 1. We 
did not find this algorithm in the literature. 

These updates illustrate the influence of setting the sum of weights to 
or not (Section 2.7). If, at some time, the first bit is set to 1 both for a 
majority of good points and for a majority of bad points, then the original 
PBIL will increase the probability of setting the first bit to 1, which is 
counterintuitive. If the weights Wi are chosen to sum to this noise effect 
disappears; otherwise, it disappears only on average. 



^Note that the pseudocode for the algorithm in [BC'!)5, Figure 4] is shghtly erroneous 
since it gives smaller weights to better individuals. The error can be fixed by updating 
the probability in reversed order, looping from NUMBER_OF_VECTDRS_TO_UPDATE_FROM to 
1. This was confirmed by S. Baluja in personal communication. We consider here the 
corrected version of the algorithm. 
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4.2 Multivariate normal distributions (Gaussians) 

Evolution strategies [Rcc73, Sch95, BS02] are black-box optimization al- 
gorithms for the continuous search domain, X C M"^ (for simplicity we 
assume X = in the following). They sample new solutions from a 
multivariate normal distribution. In the context of continuous black-box 
optimization, Natural Evolution Strategies (NES) introduced the idea of 
using a natural gradient update of the distribution parameters [WSPS08, 
SWSS09, GSS+10]. Surprisingly, also the well-known Covariance Matrix 
Adaption Evolution Strategy, CMA-ES [H096, HOOl, HMK03, HKO-l, .JA06], 
turns out to conduct a natural gradient update of distribution parameters 
[ANOKIO, GSS+10]. 

Let X £ W^. As the most prominent example, we use mean vector 
m = Ex and covariance matrix C = E(x — m){x — m)^ = E(xa;'^) — mrn^ to 
parametrize the distribution via 6 = (m, C). The IGO update in (14) or (15) 
depends on the chosen parametrization, but can now be entirely reformu- 
lated without the (inverse) Fisher matrix, compare (18). The complexity of 
the update is linear in the number of parameters (size of ^ = (m, C), where 
(d^ — d)/2 parameters are redundant). We discuss known algorithms that 
implement variants of this update. 

CMA-ES. The CMA-ES implements the equations^ 

N 

771*+^ =m} +ri^^ Wi{xi - m*) (29) 

i=l 

N 

= C7* + r/c^ 7l;,((x, - m'){xi - m'f - C*) (30) 

i=l 

where Wi are the weights based on ranked /-values, see (12) and (14). 

When r\c = rj^, Equations (30) and (29) coincide with the IGO update 
(14) expressed in the parametrization {m,C) [ANOKIO, GSS+10]^. Note, 
however, that the learning rates r/m and rjc take essentially different values 
in CMA-ES, if <C dimG^: this is in deviation from an IGO algorithm. 
(Remark that the Fisher information matrix is block-diagonal in m and C 
[ANOKIO], so that application of the different learning rates and of the 
inverse Fisher matrix commute.) 

Natural evolution strategies. Natural evolution strategies (NES) [WSPS08, 
SWSS09] implement (29) as well, but use a Cholesky decomposition of C as 
parametrization for the update of the variance parameters. The resulting 
update that replaces (30) is neither particularly elegant nor numerically ef- 
ficient. However, the most recent xNES [GSS+10] chooses an "exponential" 
parametrization depending on the current parameters. This leads to an ele- 
gant formulation where the additive update in exponential parametrization 
becomes a multiplicative update for C in (30). With C = AA^, the matrix 

^The CMA-ES implements these equations given the parameter setting ci — and 
Cct = (or do- = oo, see e.g. [liaiiO!)]) that disengages the eflect of both so-caUed evolution 
paths. 

^ In these articles the result has been derived for 6 <^ 6 + ri VeEpgf, see (9), leading to 
f{xi) in place of Wi. No assumptions on / have been used besides that it does not depend 
on 9. By replacing / with W^t, where 6* is fixed, the derivation holds equally well for 

^Specifically, let X] l^^'l ~ 1' then rym = 1 and rjc ~ 1 A ^ wf). 



24 



update reads 

A ^ A X exp (^^Y^^ m{A-'{xi - m){A-\xi - m)f - 1^) j (31) 

where 1^ is the identity matrix. (From (31) we get C A x exp^(. . . ) x Al^ .) 
Remark that in the representation 6 = {A~^m,A~^CA^ ) = (j4~^m,Irf), 
the Fisher information matrix becomes diagonal. 

The update has the advantage over (30) that even negative weights al- 
ways lead to feasible values. By default, rym 7^ r/c in xNES in the same 
circumstances as in CMA-ES (most parameter settings are borrowed from 
CMA-ES), but contrary to CMA-ES the past evolution path is not taken 
into account [GSS+10]. 

When = f/m, xNES is consistent with the IGO flow (6), and im- 
plements a slightly generalized IGO algorithm (14) using a ^-dependent 
parametrization. 



Cross-entropy method and EMNA. Estimation of distribution algo- 
rithms (EDA) and the cross- entropy method (GEM) [Rub99, RK04] estimate 
a new distribution from a censored sample. Generally, the new parameter 
value can be written as 

N 

^maxLL = ^rg max Y,Wi\nPe{xi) (32) 
^ i=i 

— >N~^oo arg max Ep^^ w/j In Pq 

For positive weights, ^maxLL maximizes the weighted log-likelihood oixi . . .x^ 
The argument under the arg max in the RHS of (32) is the negative cross- 
entropy between Pq and the distribution of censored (elitist) samples PgtWgt 
or of A'^ realizations of such samples. The distribution -Pe^iaxLL therefore 
minimal cross-entropy and minimal KL-divergence to the distribution of the 
H best samples.^ As said above, (32) recovers the cross-entropy method 
(GEM) [Rub99, RK04]. 

For Gaussian distributions, equation (32) can be explicitly written in the 
form 

N 

m*~^^ = m* = ^ WiXi (33) 
1=1 
N 

=c* = Y^ Wiixi - m*)ixi - m*f (34) 

i=l 

the empirical mean and variance of the elite sample. The weights Wi are 
equal to l/fi for the fi best points and otherwise. 

Equations (33) and (34) also define the most fundamental continuous 
domain EDA, the estimation of multivariate normal algorithm (EMNAgiobab 
[LL02]). It might be interesting to notice that (33) and (34) only differ from 
(29) and (30) in that the new mean m*"*"^ is used in the covariance matrix 
update. 

®Let Piu denote the distribution of the weighted samples: Pr(x — Xi) = Wi and 
y^ ^- Wi — 1. Then the cross-entropy between and Pg reads Pm{xi) In 1/Pe{xi) and 
the KL-divergence reads KL{P^ \\ Pe) = J^i P^i^i)^^^/Pe{xi) - P„(a;i) In 1/Pii,(a;i). 
Minimization of both terms in 9 result in SmaxLL- 
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Let us compare IGO-ML (23), CMA (29)-(30), and smoothed GEM (25) 
in the parametrization with mean and covariance matrix. For learning rate 
6t = l, IGO-ML and smoothed GEM/EMNA reahze (9maxLL from (32)-(34). 
For 5t < 1 these algorithms all update the distribution mean in the same 
way; the update of mean and covariance matrix is computed to be 

m*+^ = (1 - St) m* + 5t m* 

C*+i = (1 - 6t) & + 5tC* + 5t{l - 5ty {m* - m*)(m* - m*)^, 

for different values of j, where m* and C* are the mean and covariance 
matrix computed over the elite sample (with weights Wi) as above. For 
GMA we have j = 0, for IGO-ML we have j = 1, and for smoothed GEM 
we have j = oo (the rightmost term is absent). The case j = 2 corresponds 
to an update that uses m*"*"^ instead of m* in (30) (compare [HanOGl )]). For 
< 5t < 1, the larger j, the smaller C*"*"^. 

The rightmost term of (35) resembles the so-called rank-one update in 
CMA. 

For 5t —7- 0, the update is independent of j at first order in 5t if j < oo: 
this reflects compatibility with the IGO flow of GMA and of IGO-ML, but 
not of smoothed GEM. 

Critical 5t. Let us assume that the weights Wi are non-negative, sum to 
one, and n < N weights take a value of so that the selection quantile 
is g = n/N. 

There is a critical value of 6t depending on this quantile q, such that 
above the critical St the algorithms given by IGO-ML and smoothed GEM 
are prone to premature convergence. Indeed, let / be a linear function on 
M'^, and consider the variance in the direction of the gradient of /. Assuming 
further N ^ oo and q < 1/2, then the variance C* of the elite sample is 
smaller than the current variance C*, by a constant factor. Depending on 
the precise update for C*"*"^, if St is too large, the variance IS gomg 

to be smaller than C* by a constant factor as well. This implies that the 
algorithm is going to stall. (On the other hand, the continuous-time IGO 
flow corresponding to St ^ does not stall, see Section 4.3.) 

We now study the critical St under which the algorithm does not stall. 
For IGO-ML, (j = 1 in (35), or equivalently for the smoothed GEM in the 
expectation parameters (m, C +171771^), see Section 3), the variance increases 
if and only if St is smaller than the critical value Stent = g6\/27re^^/^ where 
b is the percentile function of q, i.e. b is such that q = /^°° e-'^'/^/V^^. This 
value Stcrit is plotted as solid line in Fig. 1. For j = 2, Stent is smaller, 
related to the above by (Stcrit ^ Vl + Stent — 1 and plotted as dashed line 
in Fig. 1. For GEM (j = 00), the critical St is zero. For GMA-ES {j = 0), 
the critical St is infinite for q < 1/2. When the selection quantile q is above 
1/2, for all algorithms the critical St becomes zero. 

We conclude that, despite the principled approach of ascending the nat- 
ural gradient, the choice of the weighting function w, the choice of St, and 
possible choices in the update for St > 0, need to be taken with great care 
in relation to the choice of parametrization. 

4.3 Computing the IGO flow for some simple examples 

In this section we take a closer look at the IGO flow solutions of (6) for 
some simple examples of fitness functions, for which it is possible to obtain 
exact information about these IGO trajectories. 
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Figure 1: Critical 5t versus truncation quantile q. Above the critical 5t, the 
variance decreases systematically when optimizing a linear function. For 
CMA-ES/NES, the critical 6t for q < 0.5 is infinite. 

We start with the discrete search space X = {0, l}*^ and linear functions 
(to be minimized) defined as /(x) = c — J2i=i a-iXi with > 0. Note that 
the onemax function to be maximized /onemax(a;) = Yli=i is covered by 
setting = 1. The differential equation (6) for the Bernoulli measures 
Peix) = P0j^{xi) . . -Pe^ixii) defined on X can be derived taking the limit of 
the IGO-PBIL update given in (28): 

= j Wf,{x){x, - e\)Pet{dx) =: gi{d') . (36) 

Though finding the analytical solution of the differential equation (36) for 
any initial condition seems a bit intricate we show that the equation admits 
one critical stable point, (1, . . . , 1), and one critical unstable point (0, . . . , 0). 
In addition we prove that the trajectory decreases along / in the sense 
'^^^ ^ ^0. To do so we establish the following result: 

Lemma 17. On f{x) = c— X]f=i CiiXi the solution of (36) satisfies X]f=i ^x-i^ 
0; moreover Oi^^ = if md only if 9 = {0, ... ,0) or 9 = {1, ... ,1). 

Proof. We compute J2f=i aigi{9^) and find that 

i=l \i=l i=l / 

Wl{x)if{9')-f{x))Pgt{dx) 

= E[wlix)]nfix)] - E[wl{x)fix)] 

In addition, -W^ix) = -w(Piif{x') < f{x))) is a nondecreasing bounded 
function in the variable f{x) such that —Wgt{x) and f{x) are positively 
correlated (see [ThoOO, Chapter 1] for a proof of this result), i.e. 

E[-Wl{x)f{x)] ^ E[-Wl{x)]E[f{x)] 

with equality if and only if (9* = (0, ... , 0) or 6** = (1, ... , 1). Thus J2f=i ^ 
0. □ 
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The previous result implies that the positive definite function V{9) = 
Ei=i Oii - Ei=i aiOi in (1, ... , 1) satisfies V*{e) = VV{e) ■ g{e) ^ (such 
a function is called a Lyapunov function). Consequently (1, . . . , 1) is stable. 
Similarly ¥{6) = Y.f= I OiOi is a Lyapunov function for (0, . . . , 0) such that 
VV{9) • g{e) ^ 0. Consequently (0, ... ,0) is unstable [AO08]. 

We now consider on the family of multivariate normal distributions 
Pe = M{m, a^Id) with covariance matrix equal to cr^Id- The parameter 6 
thus has d + 1 components 6 = {m, a) G M'^ x M. The natural gradient 
update using this family was derived in [GSS+10]; from it we can derive the 
IGO differential equation which reads: 



dm* 



W/t(x)(x - m*)p_^(^t_(^i)27^)(x)dx 




(37) 



1 \ w/i(x)p_v(„t_(^t)2j^)(x)(ix (38) 



where cr* and o"* are linked via cr* = exp((T*) or a* = ln(c7*). Denoting TV 
a random vector following a centered multivariate normal distribution with 
covariance matrix identity we write equivalently the gradient flow as 

|2 



dt 
da* 



E 



d 



1 I wiAm^ + a^M) 



(39) 
(40) 



Let us analyze the solution of the previous system on linear functions. With- 
out loss of generality (because of invariance) we can consider the linear func- 
tion /(x) = xi. We have 

W^t{x) = u;(Pr(m*i + a^Zi < xi)) 

where Zi follows a standard normal distribution and thus 

wl{m^ + a*Af) = 



w{ Pr {Zi<Mi)) 



(41) 
(42) 



with the cumulative distribution of a standard normal distribution. The 
differential equation thus simplifies into 



dm* 
"dT 

do* 
"dF 







a 



\ 

i-E 

2d 



(|M|2-l)u;(^(AAi)) 



(43) 



(44) 



Consider, for example, the weight function associated with truncation 
selection, i.e. w^q) = Ig^go where qo €]0, 1] — also called intermediate recom- 
bination. We find that 



dm 



dt 

d(T* 

'dt 



a*E[Ml|^^^^-i =: a' (3 

i_ ( r 

2d 



T (n) dn — go a. 



(45) 
(46) 
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The solution of the IGO flow for the linear function /(x) = xi is thus 
given by 



The coefficient f3 is strictly negative for any qq < 1. The coefficient a is 
strictly positive if and only if go < 1/2 which corresponds to selecting less 
than half of the sampled points in an ES. In this case the step-size a* grows 
exponentially fast to infinity and the mean vectors moves along the gradient 
direction towards minus oo at the same rate. If more than half of the points 
are selected, qq ^ 1/2, the step-size will decrease to zero exponentially fast 
and the mean vector will get stuck (compare also [HanOGa]). 

5 Multimodal optimization using restricted Boltz- 
mann machines 

We now illustrate experimentally the influence of the natural gradient, versus 
vanilla gradient, on diversity over the course of optimization. We consider a 
very simple situation of a fitness function with two distant optima and test 
whether different algorithms are able to reach both optima simultaneously or 
only converge to one of them. This provides a practical test of Proposition 1 
stating that the natural gradient minimizes loss of diversity. 

The IGO method allows to build a natural search algorithm from an 
arbitrary probability distribution on an arbitrary search space. In partic- 
ular, by choosing families of probability distributions that are richer than 
Gaussian or Bernoulli, one may hope to be able to optimize functions with 
complex shapes. Here we study how this might help optimize multimodal 
functions. 

Both Gaussian distributions on M'^ and Bernoulli distributions on {0, 1}'^ 
are unimodal. So at any given time, a search algorithm using such distribu- 
tions concentrates around a given point in the search space, looking around 
that point (with some variance). It is an often-sought-after feature for an 
optimization algorithm to handle multiple hypotheses simultaneously. 

In this section we apply our method to a multimodal distribution on 
{0,1}'^: restricted Boltzmann machines (RBMs). Depending on the activa- 
tion state of the latent variables, values for various blocks of bits can be 
switched on or off, hence multimodality. So the optimization algorithm de- 
rived from these distributions will, hopefully, explore several distant zones of 
the search space at any given time. A related model (Boltzmann machines) 
was used in [Ber02] and was found to perform better than PBIL on some 
functions. 

Our study of a bimodal RBM distribution for the optimization of a 
bimodal function confirms that the natural gradient does indeed behave in 
a more natural way than the vanilla gradient: when initialized properly, 
the natural gradient is able to maintain diversity by fully using the RBM 
distribution to learn the two modes, while the vanilla gradient only converges 
to one of the two modes. 

Although these experiments support using a natural gradient approach, 
they also establish that complications can arise for estimating the inverse 
Fisher matrix in the case of complex distributions such as RBMs: estima- 
tion errors may lead to a singular or unreliable estimation of the Fisher 




cr* = o"" exp(at) . 



exp(at) 



(47) 
(48) 



29 




Figure 2: The RBM architecture witli tlie observed (x) and latent (h) vari- 
ables. In our experiments, a single hidden unit was used. 

matrix, especially when the distribution becomes singular. Further research 
is needed for a better understanding of this issue. 

The experiments reported here, and the fitness function used, are ex- 
tremely simple from the optimization viewpoint: both algorithms using the 
natural and vanilla gradient find an optimum in only a few steps. The empha- 
sis here is on the specific influence of replacing the vanilla gradient with the 
natural gradient, and the resulting influence on diversity and multimodality, 
in a simple situation. 

5.1 IGO for restricted Boltzmann machines 

Restricted Boltzmann machines. A restricted Boltzmann machine (RBM) 
[Smo86, AHS85] is a kind of probability distribution belonging to the fam- 
ily of undirected graphical models (also known as a Markov random fields). 
A set of observed variables x G {0, 1}"^ are given a probability using their 
joint distribution with unobserved latent variables h G {0,1}"'' [Gha04]. 
The latent variables are then marginalized over. See Figure 2 for the graph 
structure of a RBM. 

The probability associated with an observation x and latent variable h 
is given by 

Pg(x,h)= " ''I.Xh') ' Pe(x) = 5^Pe(x,h). (49) 
Z^x',h' e ^ ' ^ 

where -E(x, h) is the so-called energy function and J2x' h' ^ is the 

partition function denoted Z in Section 2.3. The energy E is defined by 

£'(x, h) = — ^ aiXi — ^ hjhj — ^ WijXihj . (50) 
i j i,j 

The distribution is fully parametrized by the bias on the observed vari- 
ables a, the bias on the latent variables b and the weights W which ac- 
count for pairwise interactions between observed and latent variables: 6 = 
(a,b,W). 

RBM distributions are a special case of exponential family distributions 
with latent variables (see (21) in Section 2.3). The RBM IGO equations 
stem from Equations (16), (17) and (18) by identifying the statistics T{x) 
with Xi, hj or Xihj. 

For these distributions, the gradient of the log-likelihood is well-known 
[Hin02] . Although it is often considered intractable in the context of machine 
learning where a lot of variables are required, it becomes tractable for smaller 
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RBMs. The gradient of the log-hkelihood consists of the difference of two 
expectations (compare (16)): 



din Pg{x) 



dw. 



The Fisher information matrix is given by (20) 



'■WabWa 



^{6) = ¥.[xahbXchd\ - ^[xahh]E[xchd] 



(51) 



(52) 



where IwabWcd denotes the entry of the Fisher matrix corresponding to the 
components Wab and Wcd of the parameter 9. These equations are understood 
to encompass the biases a and b by noticing that the bias can be replaced 
in the model by adding two variables Xk and hk always equal to one. 
Finally, the IGO update rule is taken from (14): 



fc=i 



(53) 



Implementation. In this experimental study, the IGO algorithm for RBMs 
is directly implemented from Equation (53). At each optimization step, the 
algorithm consists in (1) finding a reliable estimate of the Fisher matrix (see 
Eq. 52) which is then inverted using the QR-Algorithm if it is invertible; 
(2) computing an estimate of the vanilla gradient which is then weighted 
according to VF/; and (3) updating the parameters, taking into account the 
gradient step size 6t. In order to estimate both the Fisher matrix and the 
gradient, samples must be drawn from the model Pq. This is done using 
Gibbs sampling (see for instance [Hin02]). 

Fisher matrix imprecision. The imprecision incurred by the limited 
sampling size may sometimes lead to a singular estimation of the Fisher ma- 
trix (see p. 11 for a lower bound on the number of samples needed). Although 
having a singular Fisher estimation happens rarely in normal conditions, it 
occurs with certainty when the probabilities become too concentrated over 
a few points. This situation arises naturally when the algorithm is allowed 
to continue the optimization after the optimum has been reached. For this 
reason, in our experiments, we stop the optimization as soon as both optima 
have been sampled, thus preventing from becoming too concentrated. 

In a variant of the same problem, the Fisher estimation can become close 
to singular while still being numerically invertible. This situation leads 
to unreliable estimates and should therefore be avoided. To evaluate the 
reliability of the inverse Fisher matrix estimate, we use a cross-validation 
method: (1) making two estimates Fi and F2 of the Fisher matrix on two 
distinct sets of points generated from Pg, and (2) making sure that the 
eigenvalues of Fi x ^2"^ are close to 1. In practice, at all steps we check 
that the average of the eigenvalues of Fi x F2 is between 1/2 and 2. If at 
some point during the optimization the Fisher estimation becomes singular 
or unreliable according to this criterion, the corresponding run is stopped 
and considered to have failed. 

When optimizing on a larger space helps. A restricted Boltzmann 
machine defines naturally a distribution Pq on both visible and hidden units 
(x, h), whereas the function to optimize depends only on the visible units 
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X. Thus we are faced with a choice. A first possibihty is to decide that the 
objective function /(x) is really a function of (x, h) where h is a dummy 
variable; then we can use the IGO algorithm to optimize over (x, h) using 
the distributions P0{x,h.). A second possibility is to marginalize Pe(x, h) 
over the hidden units h as in (49), to define the distribution P6»(x); then we 
can use the IGO algorithm to optimize over x using Pg{x.). 

These two approaches yield slightly different algorithms. Both were 
tested and found to be viable. However the first approach is numerically 
more stable and requires less samples to estimate the Fisher matrix. In- 
deed, if Ii{9) is the Fisher matrix at 9 in the first approach and 12(9) in 
the second approach, we always have Ii{9) ^ l2{9) (in the sense of positive- 
definite matrices). This is because probability distributions on the pair (x, h) 
carry more information than their projections on x only, and so computed 
Kullback-Leibler distances will be larger. 

In particular, there are (isolated) values of 9 for which the Fisher matrix 
I2 is not invertible whereas Ii is. For this reason, we selected the first 
approach. 

5.2 Experimental results 

In our experiments, we look at the optimization of the two-min function 
defined below with a bimodal RBM: an REM with only one latent variable. 
Such an RBM is bimodal because it has two possible configurations of the 
latent variable: h = or h = 1, and given h, the observed variables are 
independent and distributed according to two Bernoulli distributions. 

Set a parameter y G {0, 1}"^. The two-min function is defined as follows: 

/y(x) = min \xi - yi\ |(1 - Xi) - yi)\^ (54) 

This function of x has two optima: one at y, the other at its binary comple- 
ment y. 

For the quantile rewriting of / (Section 1.2), we chose the function w to 
be w{q) = so that points which are below the median are given the 

weight 1, whereas other points are given the weight 0. Also, in accordance 
with (3), if several points have the same fitness value, their weig ht Wfj is set 
to the average of w over all those points. 

For initialization of the RBMs, the weights W are sampled from a normal 
distribution centered around zero and of standard deviation l/V^x ^ f^h-, 
where rix is the number of observed variables (dimension d of the problem) 
and rih is the number of latent variables {uh = 1 in our case), so that initially 
the energies E are not too large. Then the bias parameters are initialized as 

hj i Yli ^-nd ai i Yl,j +AA(^^) so that each variable (observed 

or latent) has a probability of activation close to 1/2. 

In the following experiments, we show the results of IGO optimization 
and vanilla gradient optimization for the two-min function in dimension 40, 
for various values of the step size 5t. For each we present the median of 
the quantity of interest over 100 runs. Error bars indicate the 16th percentile 
and the 84th percentile (this is the same as meanztstddev for a Gaussian 
variable, but is invariant by /-reparametrization). For each run, the param- 
eter y of the two-max function is sampled randomly in order to ensure that 
the presented results are not dependent on a particular choice of optima. 

The number of sample points used for estimating the Fisher matrix is 
set to 10, 000: large enough (by far) to ensure the stability of the estimates. 
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The same points are used for estimating the integral of (14), therefore there 
are 10, 000 cahs to the fitness function at each gradient step. These rather 
comfortable settings allow for a good illustration of the theoretical properties 
of the = oo IGO flow limit. 

The numbers of runs that fail after the occurrence of a singular matrix 
or an unreliable estimate amount for less than 10% for 5t ^ 2 (as little as 
3% for the smallest learning rate), but can increase up to 30% for higher 
learning rates. 

5.2.1 Convergence 

We first check that both vanilla and natural gradient are able to converge 
to an optimum. 

Figures 3 and 4 show the fitness of the best sampled point for the IGO 
algorithm and for the vanilla gradient at each step. Predictably, both algo- 
rithms are able to optimize the two-min function in a few steps. 

The two-min function is extremely simple from an optimization view- 
point; thus, convergence speed is not the main focus here, all the more since 
we use a large number of /-calls at each step. 

Note that the values of the parameter 6t for the two gradients used are 
not directly comparable from a theoretical viewpoint (they correspond to 
parametrizations of different trajectories in B-space, and identifying vanilla 
6t with natural 6t is meaningless). We selected larger values of 5t for the 
vanilla gradient, in order to obtain roughly comparable convergence speeds 
in practice. 

5.2.2 Diversity 

As we have seen, the two-min function is equally well optimized by the 
IGO and vanilla gradient optimization. However, the methods fare very 
differently when we look at the distance from the sample points to both 
optima. From (54), the fitness gives the distance of sample points to the 
closest optimum. We now study how close sample points come to the other, 
opposite optimum. The distance of sample points to the second optimum is 
shown in Figure 5 for IGO, and in Figure 6 for the vanilla gradient. 

Figure 5 shows that IGO also reaches the second optimum most of the 
time, and is often able to find it only a few steps after the first. This property 
of IGO is of course dependent on the initialization of the REM with enough 
diversity. When initialized properly so that each variable (observed and 
latent) has a probability 1 /2 of being equal to 1, the initial RBM distribution 
has maximal diversity over the search space and is at equal distance from 
the two optima of the function. From this starting position, the natural 
gradient is then able to increase the likelihood of the two optima at the 
same time. 

By stark contrast, the vanilla gradient is not able to go towards both 
optima at the same time as shown in Fig. 6. In fact, the vanilla gradient 
only converges to one optimum at the expense of the other. For all values of 
6t, the distance to the second optimum increases gradually and approaches 
the maximum possible distance. 

As mentioned earlier, each state of the latent variable h corresponds to 
a mode of the distribution. In Figures 7 and 8, we look at the average 
value of h for each gradient step. An average value close to 1/2 means 
that the distribution samples from both modes: h = or h = 1 with a 
comparable probability. Conversely, average values close to or 1 indicate 
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Figure 3: Fitness of sampled points during IGO optimization. 
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Figure 4: Fitness of sampled points during vanilla gradient optimization. 
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Figure 5: Distance to the second optimum during IGO optimization. 
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Figure 6: Distance to second optimum during vanilla gradient optimization. 
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that the distribution gives most probabihty to one mode at the expense of 
the other. In Figure 7, we can see that with IGO, the average value of h 
is close to 1/2 during the whole optimization procedure for most runs: the 
distribution is initialized with two modes and stays bimodal^. As for the 
vanilla gradient, statistics for h are depicted in Figure 8 and we can see that 
they converge to 1: one of the two modes of the distribution has been lost 
during optimization. 



Hidden breach of symmetry by the vanilla gradient. The experi- 
ments reveal a curious phenomenon: the vanilla gradient loses multimodal- 
ity by always setting the hidden variable /i to 1, not to 0. (We detected no 
obvious asymmetry on the visible units x, though.) 

Of course, exchanging the values and 1 for the hidden variables in a 
restricted Boltzmann machine still gives a distribution of another Boltzmann 
machine. More precisely, changing hj into 1 — hj is equivalent to resetting 

Oj + Wij, hj i bj, and Wij < Wij. IGO and the natural gradient 

are impervious to such a change by Proposition 9. 

The vanilla gradient implicitly relies on the Euclidean norm on parameter 
space, as explained in Section 1.1. For this norm, the distance between 
the RBM distributions {ai,bj,Wij) and {a[,bj,w^j) is simply j I '^^i ~ P + 
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J2j bj — b'j + J2ij ^ij ~ "^'ij ■ However, the change of variables Oj ^ 

cii + Wij, bj < bj, Wij —Wij does not preserve this Euclidean metric. 

Thus, exchanging and 1 for the hidden variables will result in two different 
vanilla gradient ascents. The observed asymmetry on /i is a consequence of 
this implicit asymmetry. 

The same asymmetry exists for the visible variables x^; but this does 
not prevent convergence to an optimum in our situation, since any gradient 
descent eventually reaches some local optimum. 

Of course it is possible to cook up parametrizations for which the vanilla 
gradient will be more symmetric: for instance, using —1/1 instead of 0/1 for 
the variables, or defining the energy by 

E{^, h) = -J:^M^^ -h)- EjBjih, - i) - E^,JW,,{X, - ^){h, - i) (55) 

with "bias-free" parameters Ai, Bj, Wij related to the usual parametrization 
by Wij = Wij , Qi = Ai — ^ Wij bj = Bj — ^ J2i Wij ■ The vanilla gradient 
might perform better in these parametrizations. 

However, we chose a naive approach: we used a family of probability 
distributions found in the literature, with the parametrization found in the 
literature. We then use the vanilla gradient and the natural gradient on 
these distributions. This directly illustrates the specific influence of the 
chosen gradient (the two implementations only differ by the inclusion of the 
Fisher matrix). It is remarkable, we think, that the natural gradient is able 
to recover symmetry where there was none. 



5.3 Convergence to the continuous-time limit 

In the previous figures, it looks like changing the parameter 5t only results 
in a time speedup of the plots. 

^In Fig. 7, we use the following adverse setting: runs are interrupted once they reach 
both optima, therefore the statistics are taken only over those runs which have not yet 
converged and reached both optima, which results in higher variation around 1/2. The 
plot has been stopped when less than half the runs remain. The error bars are relative 
only to the remaining runs. 
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Figure 7: Average value of h during IGO optimization. 
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Figure 8: Average value of h during vanilla gradient optimization. 
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Update rules of the type 6 9 + dtVgg (for either gradient) are Euler 
approximations of the continuous-time ordinary differential equation ^ = 
with each iteration corresponding to an increment 5t of the time t. 
Thus, it is expected that for small enough 5t, the algorithm after k steps 
approximates the IGO flow or vanilla gradient flow at time t = k.6t. 

Figures 9 and 10 illustrate this convergence: we show the fitness w.r.t to 
5t times the number of gradient steps. An asymptotic trajectory seems to 
emerge when 6t decreases. For the natural gradient, it can be interpreted 
as the fitness of samples of the continuous-time IGO fiow. 

Thus, for this kind of optimization algorithms, it makes theoretical 
sense to plot the results according to the "intrinsic time" of the underly- 
ing continuous-time object, to illustrate properties that do not depend on 
the setting of the parameter 6t. (Still, the raw number of steps is more 
directly related to algorithmic cost.) 

6 Further discussion 

A single framework for optimization on arbitrary spaces. A strength 
of the IGO viewpoint is to automatically provide optimization algorithms 
using any family of probability distributions on any given space, discrete or 
continuous. This has been illustrated with restricted Boltzmann machines. 
IGO algorithms also feature good invariance properties and make a least 
number of arbitrary choices. 

In particular, IGO unifies several well-known optimization algorithms 
into a single framework. For instance, to the best of our knowledge, PBIL 
has never been described as a natural gradient ascent in the literature®. 

For Gaussian measures, algorithms of the same form (14) had been devel- 
oped previously [HOOl, WSPS08] and their close relationship with a natural 
gradient ascent had been recognized [ANOKIO, GSS+10]. 

The wide applicability of natural gradient approaches seems not to be 
widely known in the optimization community, though see [MMS08]. 

About quantiles. The IGO fiow, to the best of our knowledge, has not 
been defined before. The introduction of the quantile-rewriting (3) of the 
objective function provides the first rigorous derivation of quantile- or rank- 
based natural optimization from a gradient ascent in 0-space. 

Indeed, NES and CMA-ES have been claimed to maximize — Ep^/ via 
natural gradient ascent [WSPS08, ANOKIO]. However, we have proved 
that when the number of samples is large and the step size is small, the 
NES and CMA-ES updates converge to the IGO flow, not to the similar 
flow with the gradient of Ep^/ (Theorem 4). So we find that in reality these 
algorithms maximize EpeW/i, where W/, is a decreasing transformation of 
the /-quantiles under the current sample distribution. 

Also in practice, maximizing —Kp^f is a rather unstable procedure and 
has been discouraged, see for example [Whi89]. 

About choice of Pg: learning a model of good points. The choice 
of the family of probability distributions Pg plays a double role. 

First, it is analogous to a mutation operator as seen in evolutionary algo- 
rithms: indeed, Pg encodes possible moves according to which new sample 
points are explored. 

^Thanks to Jonathan Shapiro for an early argument confirming this property (personal 
communication) . 
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Figure 9: Fitness of sampled points w.r.t. "intrinsic time" during IGO opti- 
mization. 
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Second, optimization algorithms using distributions can be interpreted 
as learning a probabilistic model of where the points with good values lie in 
the search space. With this point of view, Pg describes richness of this model: 
for instance, restricted Boltzmann machines with h hidden units can describe 
distributions with up to 2^ modes, whereas the Bernoulli distribution used 
in PBIL is unimodal. This influences, for instance, the ability to explore 
several valleys and optimize multimodal functions in a single run. 

Natural gradient and parametrization invariance. Central to IGO 
is the use of natural gradient, which follows from ^-invariance and makes 
sense on any search space, discrete or continuous. 

While the IGO flow is exactly ^-invariant, for any practical implementa- 
tion of an IGO algorithm, a parametrization choice has to be made. Still, 
since all IGO algorithms approximate the IGO flow, two parametrizations of 
IGO will differ less than two parametrizations of another algorithm (such as 
the vanilla gradient or the smoothed GEM method) — at least if the learning 
rate 6t is not too large. 

On the other hand, natural evolution strategies have never strived for 
0-invariance: the chosen parametrization (Gholesky, exponential) has been 
deemed a relevant feature. In the IGO framework, the chosen parametriza- 
tion becomes more relevant as the step size 6t increases. 

IGO, maximum likelihood and cross-entropy. The cross-entropy method 
(GEM) [ilUivAiKUo] can be used to produce optimization algorithms given 
a family of probability distributions on an arbitrary space, by performing a 
jump to a maximum likelihood estimate of the parameters. 

We have seen (Corollary 16) that the standard GEM is an IGO algorithm 
in a particular parametrization, with a learning rate 6t equal to 1. However, 
it is well-known, both theoretically and experimentally [I3LS07, HaiiOGb, 
WAS04], that standard GEM loses diversity too fast in many situations. 
The usual solution [dBKMR05] is to reduce the learning rate (smoothed 
GEM, (25)), but this breaks the reparametrization invariance. 

On the other hand, the IGO flow can be seen as a maximum likelihood 
update with infinitesimal learning rate (Theorem 13). This interpretation 
allows to define a particular IGO algorithm, the IGO-ML (Definition 14): it 
performs a maximum likelihood update with an arbitrary learning rate, and 
keeps the reparametrization invariance. It coincides with GEM when the 
learning rate is set to 1, but it differs from smoothed GEM by the exchange 
of the order of argmax and averaging (compare (23) and (25)). We argue 
that this new algorithm may be a better way to reduce the learning rate 
and achieve smoothing in GEM. 

Standard GEM can lose diversity, yet is a particular case of an IGO 
algorithm: this illustrates the fact that reasonable values of the learning 
rate St depend on the parametrization. We have studied this phenomenon 
in detail for various Gaussian IGO algorithms (Section 4.2). 

Why would a smaller learning rate perform better than a large one in 
an optimization setting? It might seem more efficient to jump directly to 
the maximum likelihood estimate of currently known good points, instead 
of performing a slow gradient ascent towards this maximum. 

However, optimization faces a "moving target", contrary to a learning 
setting in which the example distribution is often stationary. Currently 
known good points are likely not to indicate the position at which the op- 
timum lies, but, rather, the direction in which the optimum is to be found. 
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After an update, the next elite sample points are going to be located some- 
where new. So the goal is certainly not to settle down around these currently 
known points, as a maximum likelihood update does: by design, CEM only 
tries to reflect status-quo (even for A'' = oo), whereas IGO tries to move 
somewhere. When the target moves over time, a progressive gradient ascent 
is more reasonable than an immediate jump to a temporary optimum, and 
realizes a kind of time smoothing. 

This phenomenon is most clear when the number of sample points is 
small. Then, a full maximum likelihood update risks losing a lot of diver- 
sity; it may even produce a degenerate distribution if the number of sample 
points is smaller than the number of parameters of the distribution. On 
the other hand, for smaller 5t, the IGO algorithms do, by design, try to 
maintain diversity by moving as little as possible from the current distri- 
bution Pg in Kullback-Leibler divergence. A full ML update disregards the 
current distribution and tries to move as close as possible to the elite sample 
in Kullback-Leibler divergence [dBKMR05], thus realizing maximal diver- 
sity loss. This makes sense in a non-iterated scenario but is unsuited for 
optimization. 

Diversity and multiple optima. The IGO framework emphasizes the 
relation of natural gradient and diversity: we argued that IGO provides 
minimal diversity change for a given objective function increment. In par- 
ticular, provided the initial diversity is large, diversity is kept at a maxi- 
mum. This theoretical relationship has been confirmed experimentally for 
restricted Boltzmann machines. 

On the other hand, using the vanilla gradient does not lead to a balanced 
distribution between the two optima in our experiments. Using the vanilla 
gradient introduces hidden arbitrary choices between those points (more ex- 
actly between moves in B-space). This results in loss of diversity, and might 
also be detrimental at later stages in the optimization. This may reflect the 
fact that the Euclidean metric on the space of parameters, implicitly used in 
the vanilla gradient, becomes less and less meaningful for gradient descent 
on complex distributions. 

IGO and the natural gradient are certainly relevant to the well-known 
problem of exploration-exploitation balance: as we have seen, arguably the 
natural gradient realizes the best increment of the objective function with 
the least possible change of diversity in the population. 

More generally, IGO tries to learn a model of where the good points 
are, representing all good points seen so far rather than focusing only on 
some good points; this is typical of machine learning, one of the contexts 
for which the natural gradient was studied. The conceptual relationship of 
IGO and IGO-like optimization algorithms with machine learning is still to 
be explored and exploited. 

Summary and conclusion 

We sum up: 

• The information-geometric optimization (IGO) framework derives from 
invariance principles and allows to build optimization algorithms from 
any family of distributions on any search space. In some instances 
(Gaussian distributions on or Bernoulli distributions on {0, 1}*^) it 
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recovers versions of known algorithms (CMA-ES or PBIL); in other in- 
stances (restricted Boltzmann machine distributions) it produces new, 
hopefully efficient optimization algorithms. 

• The use of a quantile-based, time-dependent transform of the objective 
function (Equation (3)) provides a rigorous derivation of rank-based 
update rules currently used in optimization algorithms. Theorem 4 
uniquely identifies the infinite-population limit of these update rules. 

• The IGO flow is singled out by its equivalent description as an in- 
finitesimal weighted maximal log-likelihood update (Theorem 13). In 
a particular parametrization and with a step size of 1, it recovers the 
cross-entropy method (Corollary 16). 

• Theoretical arguments suggest that the IGO flow minimizes the change 
of diversity in the course of optimization. In particular, starting with 
high diversity and using multimodal distributions may allow simulta- 
neous exploration of multiple optima of the objective function. Pre- 
liminary experiments with restricted Boltzmann machines confirm this 
effect in a simple situation. 

Thus, the IGO framework is an attempt to provide sound theoretical 
foundations to optimization algorithms based on probability distributions. 
In particular, this viewpoint helps to bridge the gap between continuous and 
discrete optimization. 

The invariance properties, which reduce the number of arbitrary choices, 
together with the relationship between natural gradient and diversity, may 
contribute to a theoretical explanation of the good practical performance of 
those currently used algorithms, such as CMA-ES, which can be interpreted 
as instantiations of IGO. 

We hope that invariance properties will acquire in computer science 
the importance they have in mathematics, where intrinsic thinking is the 
first step for abstract linear algebra or differential geometry, and in modern 
physics, where the notions of invariance w.r.t. the coordinate system and 
so-called gauge invariance play central roles. 
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Appendix: Proofs 

Proof of Theorem 4 (Convergence of empirical means and 
quantiles) 

Let us give a more precise statement including the necessary regularity con- 
ditions. 
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Proposition 18. Let 9 E @. Assume that the derivative ^^^gg^^^ exists 

< +00. Assume that the 



dluPgjx) 

de 



for Pg-almost all x ^ X and that Ep^ 

function w is non- decreasing and hounded. 

Let {xijiiz^ he a sequence of independent samples of Pq. Then with prob- 
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where 



w/^ X / rkjv(xO + 1/2 
W^{xi) = w ( 



with rkjv(a;i) = #{1 ^ j ^ N, f{xj) < f(xi)}. (When there are f-ties in the 
sample, Wf{xi) is defined as the average of w{{r + 1/2) /N) over the possible 
rankings r of xi .) 

Proof. Let g' : X — )■ M be any function with Ep^^^ < 00. We will show that 
I] {xi)g{xi) — > / W0{x)g{x) ^^(dx). Applying this with g equal to the 

components of ^ '"^^ ^'^^ wih yield the result. 
Let us decompose 

i E Wf{xi)g{xi) = 1 ^ W^{xi)g{xi) + ^ Y.^Wi{xi) - {xi))g{xi). 
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Each summand in the first term involves only one sample Xi (contrary to 
{xi) which depends on the whole sample). So by the strong law of large 

numbers, almost surely ^/(2^i)5'(^i) converges to j Wq {x)g{x) PQ{dx). 

So we have to show that the second term converges to almost surely. 
By the Cauchy-Schwarz inequality, we have 

By the strong law of large numbers, the second term jjYliQixi)^ converges to 



Ep^g^ almost surely. So we have to prove that the first term J2{^H 
Wg (xj))^ converges to almost surely. 

Since w is bounded by assumption, we can write 



Xi 



{Wf (xi) - Wl{xi)f ^ 2B\wf{xi) - Wiixi) 
= 2B W^ixi) - Wilixi) 



+ 2B 



Wf{xi) - Wf{xi) 



where B is the bound on \ w\. We will bound each of these terms. 

Let us abbreviate = Pr^/^p^(/(x') < /(xj)), qf = Vv^:^pg{f{x') ^ 
/(xO), rr = #{i^iV, /(xj) < /(xi)}, r+ = #{j^N, /(x,) ^ /(x^)}. 

By definition of we have 
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and moreover Wq (xi) = w{q^ ) if = q'l or Wgixi) 
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wise. 



The Glivenko-Cantelli theorem states that supj qf — r^/N 



tends to 



almost surely, and likewise for supj 
so that these errors are bounded by e. 



r^ /N . So let A/" be large enough 
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Since w is non-increasing, we have w{q~) ^ w{r^ /N — e). In the case 
l7 7^ it 1 decompose the interval [q[]qf] into {rf — r~) subintervals. 
The average of w over each such subinterval is compared to a term in the 
sum defining {xi): since w is non- increasing, the average of w over the 
k^^ subinterval is at most w{{r~ + k)/N — e). So we get 



r- — r- 

« « k=r- 



so that 



- (xi) ^ ^ ^ ^ (ii;(A;/iV - e) - ?i;((A; + l/2)/iV)). 

Let us sum over i, remembering that there are {rf — r^) values of j for 
which f{xj) = f{xi). Taking the positive part, we get 

, N , Af-l 

^ E |^/(^*) - "^N^ ^"^^^'^ "^^^^ + V2)/iV)). 

■t=l fc=0 



Since is non-increasing we have 

1 N-l „i_e_i/Ar 

and 

1 -^-1 j-l+l/2N 

N Jl/2N 

(we implicitly extend the range of w so that w{q) = 'w{0) for q < 0). So we 
have 



w 



1 A I f — . /"^/^Af /-l+l/2Af 

^ + J-e-l/N Jl-e-l/N 

where B is the bound on Ifwl 



~[ I + J-e-l/N Jl-e-l/N 



Reasoning symmetrically with w{k/N -\- e) and the inequalities reversed, 
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we get a similar bound for jfY^ — ^^•'^(xj) . This ends the proof. 



□ 



Proof of Proposition 5 (Quantile improvement) 

Let us use the weight w{u) = lu^q- Let m be the value of the g-quantile 
of / under Pqi. We want to show that the value of the g-quantile of / 
under Pgt+st is less than m, unless the gradient vanishes and the IGO flow 
is stationary. 

Let p_ = Pri:^p^,(/(x) < m), pm = VT^^p^^{f{x) = m) and p+ = 
Pi'x~Pgt if{x) > m). By definition of the quantile value we have p_ +Pm ^ Q 
and p+ + pm ^ 1 — <?• Let us assume that we are in the more complicated 
case Pm 7^ (for the = 0, simply remove the corresponding terms). 

We have Wq^{x) = 1 if f{x) < m, Wg^{x) = if f{x) > m and Wq^{x) = 
— w{u)du = if f{x) = m.' 

Pm ■'P- ^ ■' Pm J \ ■' 

Using the same notation as above, let gt{0) = J W^tix) Peidx). Decom- 
posing this integral on the three sets f{x) < m, f{x) = m and f{x) > m, we 
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get that gt{0) = FT^^Pg{f{x) < m) + Pr^^Pg(/(x) = m)^^. In particular, 
9t{0') =q. 

Since we follow a gradient ascent of gt, for 5t small enough we have 
gt{0^'^^) > gt{S^) unless the gradient vanishes. If the gradient vanishes we 
have 0*+''* = 0^ and the quantiles are the same. Otherwise we get gt{0^^^) > 
9t{0') = q. 

Since ^ ^^'^l'::^"'' = 1> ^e have gt{e) ^ Pr,.^p,(/(x) < m) + 

^"^x^PeUix) =m)= VT^^Pg{f{x) ^ m). 

So Pra;^p^^^^^ (/(x) ^ m) ^ gt{G^^^^) > Q- This implies that, by defini- 
tion, the g-quantile value of Pgi+u is smaller than m. 

Proof of Proposition 10 (Speed of the IGO flow) 

Lemma 19. Let X he a centered random variable with values in and 
let A he a real-valued LP' random variable. Then 

||E(^X)|| < VAVar^ 

where A is the largest eigenvalue of the covariance matrix of X expressed in 
an orthonormal basis. 

Proof of the lemma. Let v be any vector in W^; its norm satisfies 

||f II = sup {v ■ w) 

w, ||ui||$;i 

and in particular 

||E(AX)|| = sup {wE{AX)) 

w, ||ui||s;i 

= sup E{A{wX)) 
w, ||ui||s;i 

= sup K{{A — KA) {w • X)) since {w • X) is centered 

w, ||ui||^;i 

^ sup 

w, ||ui||^;i 

by the Cauchy-Schwarz inequality and using the fact that A is centered. 
Now, in an orthonormal basis we have 

{wX) = J2wiXi 

i 

SO that 

Eiiw ■ xf) = E {iE^mX,)iJ:JW,x,)) 

= J:iEJnw^Xiw,XJ) 
= J:iJ:Jw^w,Eix,XJ) 

with Cij the covariance matrix of X. The latter expression is the scalar 
product (w ■ Cw). Since C is a symmetric positive-semidefinite matrix, [w ■ 
Cw) is at most AHf^lp with A the largest eigenvalue of C. □ 

For the IGO flow we have ^ = E^^p,w/(x)V0 In Pg{x). 
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So applying the lemma, we get that the norm of ^ is at most y A Var^^^p^ W/j 
where A is the largest eivengalue of the covariance matrix of lnPg{x) (ex- 
pressed in a coordinate system where the Fisher matrix at the current point 
9 is the identity). 



By construction of the quantiles, we have Var^j^p^ Wg{x) ^ Var^^i 



w 



(with equality unless there are ties). Indeed, for a given x, let U he a 
uniform random variable in [0, 1] independent from x and define the ran- 
dom variable Q = q~{x) + {q^{x) — q~{x))U. Then Q is uniformly dis- 
tributed between the upper and lower quantiles q^{x) and q~{x) and thus 
we can rewrite Wq{x) as 'E{w{Q)\x). By the Jensen inequality we have 
VarW/(x) = Var E(w((3)|x) ^ \aiw{Q). In addition when x is taken un- 
der Pq, Q is uniformly distributed in [0, 1] and thus Vartt;(Q) = Varp^i] w, 



i.e. Yaixr^Pg W/(x) ^ Var[o,i] 

Besides, consider the tangent space in 0-space at point 0*, and let us 
choose an orthonormal basis in this tangent space for the Fisher metric. 
Then, in this basis we have Vi\nPg{x) = diln Pg{x). So the covariance 
matrix of VlnPg{x) is K x Pg {di In Pg{x)dj In Pg{x)), which is equal to the 
Fisher matrix by definition. So this covariance matrix is the identity, whose 
largest eigenvalue is 1. 

Proof of Proposition 12 (Noisy IGO) 

On the one hand, let Pg be a family of distributions on X. The IGO algo- 
rithm (13) applied to a random function /(x) = /(x, uj) where cj is a random 
variable uniformly distributed in [0, 1] reads 

N 
i=l 

where Xj ~ Pg and Wi is according to (12) where ranking is applied to the 
values f{xi,uji), with Wj uniform variables in [0, 1] independent from Xj and 
from each other. 

On the other hand, for the IGO algorithm using Pg f^[o,i] a-^id applied 
to the deterministic function /, Wi is computed using the ranking according 
to the / values of the sampled points Xj = (xj, Wj), and thus coincides with 
the one in (56). 

Besides, 



w. 



Vg In Pemio.i] i^i) = "^e (xj) + Vg In [/[o,i] 



=0 

and thus the IGO algorithm update on space X x [0, 1], using the family 
of distributions Pg = Pg f^[o,i]) applied to the deterministic function /, 
coincides with (56). 

Proof of Theorem 13 (Natural gradient as ML with infinitesi- 
mal weights) 

We begin with a calculus lemma (proof omitted). 

Lemma 20. Let f be real-valued function on a finite- dimensional vector 
space E equipped with a definite positive quadratic form \\ ■ |p. Assume f is 
smooth and has at most quadratic growth at infinity. Then, for any x £ E, 
we have 

\7f{x) = lim — argmax < /(x -\- h) 
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where V is the gradient associated with the norm \\ ■ \\. Equivalently, 

argmax|/(y) - ^||y - | = x + eVf{x) + 0{e^) 
when e — )■ 0. 

We are now ready to prove Theorem 13. Let be a function of x, and 
fix some 9q \n Q. 

We need some regularity assumptions: we assume that the parameter 
space G is non-degenerate (no two points ^ G define the same probabiHty 
distribution) and proper (the map Pg ^ 9 \s continuous). We also assume 
that the map 9 ^ Pq is smooth enough, so that / log Pe{x) W{x) Pq^ (dx) is 
a smooth function of 9. (These are restrictions on ^-regularity: this does 
not mean that W has to be continuous as a function of x.) 

The two statements of Theorem 13 using a sum and an integral have 
similar proofs, so we only include the first. For e > 0, let 6 be the solution 
of 

e = argmax|(l-e) J logPe(x) ^^^(dx) + e J logPe(x) VF(x) Pe„(dx)|. 
Then we have 

e = argmax| J log Pg{x) Pe,{dx) + e J logP0(x) {W{x) - 1) Pe^idx)^ 
= arg max | J log ^^(x) Pq^ (dx) - J log Pq^ (x) Pg^ (dx) + e j log Pe{x) {W{x) - 1) P^o (dx) 

(because the added term does not depend on 9) 
= arg max | - KL{Pe^ \\ Pe) + e j logPe{x) {W{x) - 1) P0o(dx)| 

= argmax|-^KL(Peoim) + / log Pe{x) {W {x) - 1) Pe^{dx)^ 

When e — )■ 0, the first term exceeds the second one if 'KL{Pqq \\ Pe) is too 
large (because W is bounded), and so KL(PeQ || Pg) tends to 0. So we can 
assume that 9 is close to ^o- 

When 9 = 9o + 69 is close to 9o, we have 

KL(Peo II ^e) = ^ E ^ii(^o) S9i 59, + 0{S9^) 

with Iij{9o) the Fisher matrix at ^o- (This actually holds both for KL{Pg^^ \\ Pg) 
and KL{Pe\\Pg^).) 

Thus, we can apply the lemma above using the Fisher metric J2 ^ij (^o) ^^i ^^j j 
and working on a small neighborhood of 9q in 0-space (which can be iden- 
tified with M'^™®). The lemma states that the argmax above is attained 
at 

9 = 9o + eVeJ log Pg{x) {W{x) - 1) P,,(dx) 

up to O(e^), with V the natural gradient. 

Finally, the gradient cancels the constant —1 because / (V log Pg) Pg^ = 
at 9 = 9q. This proves Theorem 13. 
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Proof of Theorem 15 (IGO, GEM and IGO-ML) 

Let Pg be a family of probability distributions of the form 



Peix) = ^exp (^^.r,(x)) H{dx) 



m 

where Ti, . . . , is a finite family of functions on X and H{dx) is some ref- 
erence measure on X. We assume that the family of functions (Tj)j together 
with the constant function Tq{x) = 1, are linearly independent. This pre- 
vents redundant parametrizations where two values of 9 describe the same 
distribution; this also ensures that the Fisher matrix Cov(Ti,Tj) is invert- 
ible. 

The IGO update (14) in the parametrization Tj is a sum of terms of the 
form 

Vf\nP{x). 

So we will compute the natural gradient V^^. in those coordinates. We first 
need some general results about the Fisher metric for exponential families. 

The next proposition gives the expression of the Fisher scalar product 
between two tangent vectors 5P and 5' P of a statistical manifold of ex- 
ponential distributions. It is one way to express the duality between the 
coordinates 6i and Tj (compare [ANOO, (3.30) and Section 3.5]). 

Proposition 21. Let 59i and 5'9i be two small variations of the parameters 
9i. Let 6P{x) and 5'P(x) be the resulting variations of the probability dis- 
tribution P, and 6Ti and 6'Ti the resulting variations ofTi. Then the scalar 
product, in Fisher information metric, between the tangent vectors 5P and 
5'P, is 

{6P, 5'P) = 69i S'Ti = J2 ^'^i ^Ti. 

i i 

Proof. By definition of the Fisher metric: 
{SP,5'P) = Y,Lij59i5'9j 

id 



\-^f)S'9 / glnP(x) dlnPjx) 

^ainP(x) ^ dlnPjx) , 

J 



j Y.{Ti{x)-fi)59^5'P{x) by (16) 
•^^ i 

^ 59i (^J^ T,{x) 5'P{x)^ - ^ 59i% ^ 5'P{x) 



= Y,59i5% 

i 

because 5'P{x) = since the total mass of P is 1, and Ti{x) 5' P{x) = 
6'Ti by definition of T, . □ 
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Proposition 22. Let f be a function on the statistical manifold of an expo- 
nential family as above. Then the components of the natural gradient w.r.t. 
the expectation parameters are given by the vanilla gradient w.r.t. the natural 
parameters: 

dl 



and conversely 



(Beware this does not mean that the gradient ascent in any of those 
parametrizations is the vanilla gradient ascent.) 

We could not find a reference for this result, though we think it known. 

Proof. By definition, the natural gradient V/ of a function / is the unique 
tangent vector 5P such that that, for any other tangent vector 6'P, we have 

6'f = {6P,6'P) 

with (•, •) the scalar product associated with the Fisher metric. We want to 
compute this natural gradient in coordinates Tj, so we are interested in the 
variations 5Ti associated with 6P. 

By Proposition 21, the scalar product above is 

{SP,S'P) = Y,6fi5'ei 

where 5Ti is the variation of Tj associated with 5P, and 5'9i the variation of 
6i associated with S'P. 

On the other hand we have 6' f = J2i ^ ^'^i- So we must have 



for any S'P, which leads to 

as needed. The converse relation is proved mutatis mutandis. □ 

Back to the proof of Theorem 15. We can now compute the desired 
terms: 

by (16). This proves the first statement (27) in Theorem 15 about the form 
of the IGO update in these parameters. 

The other statements follow easily from this together with the additional 
fact (26) that, for any set of weights aj with J^^i — 1; the value T* = 
J2i ci{i)T{xi) is the maximum likelihood estimate of o,{i) logP(xi). 
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